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Abstract 


With the gradually entering of the smart productions into our lives, computer 
vision as the eyes of artificial intelligent has become the leading topic in the field of 
recent scientific research. Among them, multi-view based 3D _ reconstruction 
technology is an important research subject of computer vision and has got highly 
concerned with the popularity of artificial intelligence in recent years. The birth of 
multi-view based 3D reconstruction has effectively made up for the disadvantages of 
traditional image processing techniques, which are based on manual modeling 
software by using ideal mathematical and physical models. Up to now, lots of related 
works are putting in more effort to achieve better 3D reconstruction for the objects 
or even scenes. Unfortunately, most of them still suffer from great deficiencies in 
modeling accuracy, computing speed and research cost. Due to the accuracy and 
efficiency of 3D reconstruction are overly depending on the imaging equipment and 
surrounding environment, it is too sensitive to the quality of the inputting 
information. On the other hand, expensive image acquisition devices usually increase 
the research costs. Furthermore, most of the existing 3D reconstruction algorithms 
are lack of semantic information, which is unable to automatically and selectively 
eliminate unnecessary object or interferences during modeling. Thus, this problem 
will also affects the performance of reconstruction. All of the above issues have 
seriously restricted the further development of the 3D reconstruction applications. 

Therefore, all the works in this paper are mainly focus on solving the above 
problems. We are committed to propose a multi-view based 3D reconstruction 
algorithm with higher accuracy, more efficiency and rich semantic information. This 
research will have important theoretical and practical significance for improving the 
performance of 3D reconstruction. The works of our research are mainly reflected in 
the following aspects: 

(1) A SalientPatch-based blind deblurring algorithm is proposed as the 


pre-processing to reduce the noises to further improve the accuracy of inputting 
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images for 3D reconstruction. 

Not all information in the blurred image can promote the estimation of the blur 
kernel. We have found that regions contribute to the production of high-quality blur 
kernel tend to focus on the foreground region of the image rather than the 
background. Based on this discovery, we locate a SalientPatch for blur kernel 
estimation by computing object probability, structural richness, and local contrast. 
Our proposed strategy is applicable to most of the current maximum a posteriori 
(MAP) frameworks. Experiments prove that our proposed SalientPatch-based blind 
deblurring algorithm significantly improves the deblurring accuracy and speeds up 
the operation efficiency. At the same time, it provides high-quality image input for 3D 
reconstruction, which effectively avoids the problem that 3D reconstruction is 
sensitive to image sharpness. 

(2) A hierarchical convolution based pedestrians detection algorithm is 
proposed as a priori of stereo matching to reduce the interferences from the 
dynamic pedestrians for 3D reconstruction of real scene. 

In order to effectively solve the multi-scale and occlusion problems of pedestrian 
detection, we propose a new fully connected convolutional neural network. The 
feature maps from different convolutional layers are used to detect pedestrians of 
different scales to improve the detection accuracy of small-scale pedestrians. During 
the training phase, a new predictive box that perfectly fits the pedestrian shape is 
proposed, which effectively reduces the pedestrian's false detection rate and 
improves operational efficiency. Finally, we optimized the SSD loss function to further 
improve the accuracy and efficiency of pedestrian detection. Experimental results 
show that our pedestrian detection algorithm effectively reduces the impact of 
dynamic pedestrians on the stereo matching of scenes, which further enhances the 
accuracy and speed of 3D reconstruction. 

(3) A hybrid tree-guided PatchMatch and quantizing acceleration algorithm is 
proposed as the core technology to achieve multi-view based depth estimation in 
both higher accuracy and efficiency. 


Two independent algorithms (hybrid tree cost aggregation and PatchMatch 
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stereo) are seamlessly merged to achieve MVS matching applications while 
maintaining or even improving the solution quality. Our algorithm not only 
significantly accelerates the estimation speed of PatchMatch but also improves the 
accuracy of disparity from one-pixel to sub-pixel-level accuracy. Meanwhile, an 
effective quantizing acceleration strategy is proposed by generating a linear 
interpolation of matching cost between the two closest disparity values to yield high 
efficiency in cost computation. Experimental results show that our proposed 
algorithm generate higher-quality depth images and suitable for multi-views 
reconstruction in the real scene. 

(4) An unified variational formula for joint depth map interpolation and 
segmentation is proposed as the post-processing of stereo matching to up-sample 
the depth image. 

We make full use of the complementary features of segmentation and 
interpolation, and propose a new depth map up-sampling algorithm based on joint 
segmentation and interpolation. We first decompose the depth map into multiple 
segments based on color image guidance. We interpolate the depth information 
based on the seed pixels of each segment, which solves the problem of edge blurring. 
At the same time, we use multiple planes to fit the original depth map and then 
estimate the missing depth information. This can significantly reduce the texture 
replication problem. Experimental results show that our proposed depth map 
up-sampling algorithm significantly further improves both the accuracy and visual 


effect of the 3D reconstruction. 
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Chapter 1 
Introduction 


1.1 Research Background 


With the rapid development and rising of CV (Computer Vision) technologies 
such as virtual reality, augmented reality and mixed reality, they have been widely 
applied in various Al (Artificial Intelligence) productions such as: urban planning and 
modeling, autopilot, education and simulation, movie effects production, battlefield 
and terrain analysis. These applications successfully achieve interaction and 
integration between the real world and the virtual world by utilizing powerful 
computer calculation and display capabilities. In the future, there will be a wider 
range of applications prospects in the field of entertainment, military, medical, 
education, construction, aerospace, shipbuilding, justice, archaeological, industrial 
measurement and so on. 

However, the objective world is a three-dimensional space and the ordinary 
cameras only can acquire two-dimensional information from the space. Therefore, 
how to utilize computer technology to automatically convert multi-view images into 
a 3D model becomes the key issue. Although traditional manual modeling software 
(such as 3D MAX, MAYA, and CAD, etc.) are improving day by day, rebuilding large 
scale (such as city level) and complex 3D models are always a very time-consuming 
and labor-intensive task. So they can no longer meet the requirements of existing 
applications. Recently, the multi-view based 3D reconstruction technology is 
proposed to achieve realistic virtual 3D world expression and interaction through 
geometric figures and video/images. Since then, it usually acts as the core technique 
of the intelligent automation applications. So it quickly become one of the most hot 
research topic in the field of CV and Al. Since the technology was proposed to the 
present, multi-view based 3D reconstruction has more than 20 years of development 
history. However, existing algorithms are still not mature enough for practical 


applications. With the development of the intelligence applications, users from 
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different fields have put forward more stringent requirements on the accuracy and 
efficiency of 3D reconstruction, which makes the existing algorithms remain some 


bottleneck problems to be solved in the future. 


(b) 


ea mal hy LY | 


oe 


Figure 1.1 3D Reconstruction Technologies 


Notes: 
(a) 3D Modeling Software; (b) Laser Depth Scanning Device; 
(c) Multi-View based 3D Reconstruction; (d) Binocular 3D Reconstruction of a Single Object. 


Different from traditional modeling software, the multi-view based 3D 
reconstruction technology can automatically convert several images from different 
views into realistic 3D models without manually painting (results as shown in Figure 
1.1 (c) and (d)). It can significantly compensate for the deficiencies of traditional 
modeling and reduce the workload. However, the existing 3D reconstruction 
algorithms still cannot completely replace the traditional methods. One of the 
biggest bottlenecks is the existing methods cannot simultaneously achieve high 
accuracy and efficiency multi-view based 3D reconstructions especially for large-scale 
scenes. Due to the modeling accuracy of a multi-view based 3D reconstruction 
algorithm is sensitive to the objects or surrounding environment (such as texture, 
lighting, occlusion, motion blur and so on). To guarantee modeling quality, high 


complexity of algorithms are usually consuming too much time. On the other hand, 
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the research cost of the development is usually uneconomical and hard to be 
commercialized. Meanwhile, to obtain a 3D reconstruction that perfectly interprets 
the real world, it is necessary to understand the structural relationships and semantic 
information between the scenes. Unfortunately, most algorithms ignore the 
importance of semantics, which leads to serious distortion of the reconstruction 
model. All of the above problems have seriously affected the modeling accuracy and 
computation speed of multi-view based 3D reconstruction and restricted the further 
development of the 3D reconstruction applications. 

An effective solution is to first process the inputting multi-view images before 
stereo matching. By removing the noise and enhancing the clarity of the images, the 
dependence of high-accuracy image acquisition device will be weaken. Secondly, a 
novel global stereo matching algorithm and acceleration strategy should be carefully 
designed to achieve multi-view based depth estimation in both high accuracy and 
fast speed. During the depth estimation, a target detection algorithm can provide 
accuracy semantic information for stereo matching algorithm to detect the moving 
targets and remove the dynamic interferences in realistic complex scene. It will help 
to improve the accuracy of depth estimation, but also can avoid unnecessary 
matching to accelerate the operation speed. Based on the above scheme, we 
chooses image blind deblurring technology, multi-view based stereo matching 
technology, pedestrian detection technology and depth interpolation technology as 
the breakthrough points. The main goal is to find out how to achieve high accuracy 
and efficiency multi-view based 3D reconstruction simultaneously for complex scene 
or object. Our study can provide better technical support for future intelligence 
applications such as smart city, simulation and virtual reality. In next chapter, we will 
briefly review the status of relevant research while pointed out the challenges and 


the main research content of this paper. 
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1.2 Related Work 


1.2.1 Blind Image Deblurring 

Due to the camera movement and environment impact, the acquired photos 
inevitably suffer from blurred and distorted problems. Obviously, the clarity of 
images will be decreased by some external interference in photo acquisition, 
recording, storage, low quality imaging devices and transmission media. These 
images, which are affected by external factors, are collectively referred to as blurred 
images. Blurred images are unacceptable in many applications that require high 
quality input, such as license plate recognition, surveillance monitoring, and medical 
diagnostics. We summarized five blur image types as follow: motion blur(horizontal 
movement), shake blur(vertical movement), natural meteorological blur (smoke), 
intrinsic physical blur (sensor quality) and defocus blur (imaging) [W114]. The target 
of image deblurring algorithm is to automatically restore high resolution images from 
degraded ones by using image processing technologies. To address the motion blur 
problem, a single blur image is captured during the continuous exposure using 
relative motion between the camera lens and the scene, which limits the uniformity 
and spatial invariance of the blur kernel. Therefore, the process of image blur 
formation can be described by a convolution equation: B = I xX k +n, where B is a 
single motion blur image, I denotes the clear image, k and n is blur kernel (a.k.a. 
point spread function) and external noise. It can be seen that the image deblurring of 
motion blur is considered to be the deconvolution problem within only blurring 
image available, and its main purpose is to estimate the blur kernel. Obviously, when 
estimating the blur kernel, the deblurring problem is transformed into a non-blind 
deconvolution problem. On the other hand, whether the blur kernel is unknown or 
known is determined whether it is a blind or non-blind deblurring problem. At the 
same time, it has also pointed out by [LWDFO9] that estimating blur kernel first and 
then restoring the clear image in the whole deblurring process. Because the blind 
deconvolution problem is highly difficult, the kernel estimation step always takes up 


most of the time in the deblurring algorithm. In addition, the accuracy of the blur 
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kernel estimation determines the overall performance of the non-blind deblurring 
algorithms. Therefore, the kernel estimation algorithm plays an absolutely important 
role in every blind deblurring algorithm for it directly determines the accuracy and 
efficiency of image deblurring. The operational speed of a kernel estimation 
algorithm is related to many factors, including the calculation method, the kernel size, 
and the image size used for the estimation. 

Research on the problems of single motion blur image restoration has been a 
long history. In the past decade, many excellent theories have been proposed to 
support the blind deblurring problem. Typically, blind deblurring methods focus on 
deblurring using natural image prior information or additional image observation. 
This processing method is collectively referred to as the Maximum A Posteriori (MAP) 
[CW98] algorithm, which has a common view that natural images have Gaussian 
features and the blurring process tends to deviate from natural image features. For 
most existing blind deblurring algorithms, it can be seen that the most direct way to 
improve the accuracy of blur kernel estimation is to use each pixel in the input image, 
because the blind deblurring problem is highly lacking information, but such 
methods are usually very time consuming. However, most current methods do 
estimate blur kernels based on the entire image. It is worth noting that not all pixels 
or sub-windows of the input blurred image can be used to blur the information of 
the kernel estimate. As [FSHO6], [LWDFO9], [BFC12] and [XZJ13] proposed, image 
regions with significant gradients in blurred images play an important role in kernel 
estimation, especially for methods based on significant edges. On the contrary, [XJ10] 
demonstrated that if the size of the blurred object is smaller than the size of the blur 
kernel, the edge information may affect the accuracy of the kernel estimate. In 
addition, the smoothed area usually does not change after the blurring process, and 
it is easy to generate the wrong delta function kernel. Therefore, the full image based 
estimation strategy does not effectively improve the accuracy of the blur kernel. 
Because of the bad effects from smooth areas and fine edges will possibly reduce the 
overall estimation accuracy while increase the computational cost. In summary, for 


the case of large-size blurred images, most current blind deblurring algorithms still 
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cannot achieve high kernel estimation accuracy and fast computational efficiency at 
the same time. 

1.2.2. 3D Reconstruction 

3D reconstruction technology is a comprehensive, multidisciplinary fusion and 
complex research field, which includes computer vision, computer graphics, pattern 
recognition and geometry, physics and other disciplines. Traditional 3D modeling 
applies modeling software to generate 3D geometric models of objects by using 
computer-aided drawing techniques. The geometric shapes are represented by the 
curved surfaces with mathematical expressions. Different from traditional modeling, 
3D reconstruction can be defined as establishing the mathematical models of real 
objects that is suitable for computer representation and processing. It is a key 
computer technology to construct virtual reality and express the objective world. It 
can be summarized as following steps: 

(1) Image acquisition 

Before 3D reconstruction processing, a camera is required to acquire 2D images 
of an object in different views. Due to the lighting conditions and the geometric 
characteristics of camera device will have a great impact on the image processing, 
the pre-processing is necessary to reduce the impact of the environment on the 
inputting images. 

(2) Camera calibration 

It is used to establish an effective imaging model and calculate the internal 
parameters and external parameters of the current acquisition device. In this way, 
the 3D point coordinates in the space can be obtained by combining the matching 
results and camera parameters to achieve the purpose of 3D reconstruction. Camera 
calibration is one of the key steps in 3D reconstruction. The quality of calibration 
directly affects the overall modeling accuracy of 3D reconstruction. 

(3) Stereo matching 

It is the core algorithm of 3D reconstruction, which is used to find out the 
one-to-one pixels in two or more views of images. Stereo matching will construct and 


minimize an energy cost function for disparity estimation in order to calculate the 
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depth information of each pixel. Due to the interference of some factors in the scene, 
such as lighting conditions, noise interference, complex geometry of the object, 
surface physical properties, camera characteristics and many other environmental 
factors, the existing stereo matching techniques still cannot provide satisfactory 
depth estimation results. Therefore, how to improve the accuracy of depth 
estimation has always been a bottleneck problem in the field of computer vision. In 
this step there always includes post-processing to further improve the accuracy and 
visual effects of depth map. 

(4) 3D point cloud reconstruction 

After stereo matching, the point cloud information of the scene or objects can 
be restored by combining with the internal and external parameters calibrated by the 
camera. It is clearly that the accuracy of the reconstructed three-dimensional point 
cloud is determined by the accuracy of stereo matching and camera calibration. In 
the research of 3D reconstruction, this paper focuses on this two basic issues of 
camera tracking and depth estimation as the breakthrough point. We mainly focus 
on how to recover 3D geometry and motion information from real-life image data 
and reuse it. 

As below, we will briefly introduce the research background related to the study 
of this paper: 

1. Structure from Motion Techniques 

First of all, it is necessary to recover the camera’s motion parameters for model 
reconstruction of objects or scene. Structure from motion (SfM) is the most 
commonly used algorithm for camera calibration, which refers to analyzing the 
three-dimensional structure based on the motion of an object to infer the camera 
parameters. Assuming that the object is stationary and only the camera is moving, 
SfM technology can recover the camera parameters and three-dimensional structure 
of the object from the image sequence captured by the camera. It is a classic 
problem that has been studied for decades in the field of computer vision. The 
theoretical system of SfM is quite large, which involves feature point matching and 


tracking, camera internal and external parameter calibration and so on. After years of 
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hard work, the majority of scientific researchers have done a lot of research work in 
this area [TMHF99, Cor01, GCHO3, PVGV04, SSSO6], which make the SfM technology 
basically lead to mature. Here we will briefly introduce the concepts of camera 
models, perspective projections and Euclidean transforms involved in this study to 
more deeply understand the basic ideas of 3D reconstruction through mathematical 
methods. 

(1) Camera Model 

The camera utilizes a convex lens (camera lens) and exposure technology to 
project the 3D point of the real world onto a 2D plane (photos). This projection 
relationship can express as a multiple coordinates transformation, which is also 
called as a camera imaging model. In computer vision, we use camera models to 
explain the connection between space points with image pixels. Existing camera 
models can be summarized as linear models (pinhole models) and nonlinear models. 
As modern lens production processes become more sophisticated, lens distortion is 
essentially negligible. In order to simplify the complexity of the algorithm, most of 
the 3D reconstruction studies refer to the ideal pinhole camera model. It mainly 
describes the conversion relationship between the world coordinate system, camera 
coordinate system, image coordinate system and pixel coordinate system. In the 
pinhole models, the optical center is acted as a small hole and it is assumed that the 
light meets the condition of straight line propagation. 

Here, let a point on the three-dimensional space be defined as (X,Y,Z), and its 
coordinates projected onto the two-dimensional plane are (x,y), as shown in Figure 
1.2. It can be seen that optical axis passes through the center of the camera and is 
perpendicular to the projection plane. The intersection of the optical axis and the 
projection plane is denoted as O. The distance from the camera center to the 
projection plane is denoted as f (focal length). The above projection process can be 


written as the following equation based on the similar triangle principle: 


x= fo, boa de (1.1) 
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World 


Image 


Camera 


X 


Figure 1.2 Perspective Projection 


Notes: The relational ship between camera coordinate, image coordinate and world 


coordinate system. 


For convenience of expression, we convert the above formula into a 
homogeneous coordinate representation. The perspective projection can be written 


as a linear transformation: 


x 
x] [1000 
4 
y|~|0100]- (1.2) 
vA 
fl |0010 
1 


In general, the image coordinates are not the same as the coordinates on the 
projection plane. This relationship depends on the size and shape of the pixels and 
the position of the optical center. Therefore, there is a conversion relationship 
between the image coordinate system and the pixel coordinate system. This 


relationship can be expressed by the following equation: 


x a, Ss Xo Xcam Xeam 
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Where K is a 3x3 upper triangular matrix, which is called camera internal 
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parameter matrix. It can be defined as follows: 


K= a, Yo (1.4) 


The internal parameter matrix of camera includes five parameters. The 
projection center (x,y) represents as the intersection of the optical axis and the 
image plane. The a, and ay are respectively defined as the scaling ratios in the x 
and y directions and s represents the tilt rate. The a,/a, is also called as aspect 
ratio. 

Since the camera can be placed anywhere in the environment, it is necessary to 
select a reference coordinate system to describe the position of the camera and the 
object in the space. This coordinate system is called the world coordinate system, 
which consists of three coordinate axes: X,,, Y,, and Z,,. The relationship between 
the camera coordinate system and the world coordinate system can be described by 
both rotation matrix R and the translation vector T. Then, the homogeneous 
coordinates of a point P in world coordinate system and camera coordinate system 
are respectively defined as P = (Xq,Yu,Zq.1)' and x = (Xeam Veam Zeam 1)’. 
This two coordinate systems can be described by an Euclidean style transformation 


as blew: 


Xam XxX, 
Yeam R T ye 
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Where R is represented as the rotation relationship between camera and world 
coordinate system. It is represented by a 3 x 3 orthogonal matrix. Tis a 3 x 1 column 


vector, which is used to represent the translation relationship between the camera 
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and world coordinate. R and T jointly depict the position of the object in real space, 
so it is called as external parameters or external motion parameters of the camera. 
According to the above transformation, we can finally transfer the coordinates of 
point P in world coordinate system into its projection point p in image system 
(conversion relation between world coordinates and pixel coordinates): 
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From the above equation, it is clearly that if the position of the image point p is 
known, X,, is still indefinite even if the internal and external parameters of camera 
are all known. In fact, since M is a 3x4 irreversible matrix, we can only get two linear 
equations about X,,, Y,, and Z,, by eliminating z when M and (x,y) are known. 
However, the equation composed of these two linear equations are the ray OP. In 
other word, all spatial points with projection point p are on this ray as shown in 
Figure 1.3. When the image point p is known, any space point P on the ray OP will be 
projected on the same image point p by the pinhole imaging model. Therefore, the 
space point cannot be uniquely determined. At last, the purpose of 3D 
reconstruction is to utilize the limited information provided by the images to find and 
determine spatial points corresponding to the image pixels. 

(2) Binocular Geometry 

The multi-view 3D reconstruction theory indicates that the three-dimensional 
geometric correspondence between two images can be calculated as long as the 
camera model is obtained. First of all, we need to figure out whether the points on a 
given image have position constraints at the corresponding points of another image. 
As shown in Figure 1.3 (a), we give two images of the same tree taken at two 
different views. In fact, the corresponding point of the red point on the treetop in the 


left image will be located on a straight line (yellow) in the right image. This is because 
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a ray extends from the center of the camera through the red dot on the image and 
the three-dimensional position of this point should be on this ray, as shown in Figure 


1.3 (b). 


epipolar plane 


epipolar line 


epipole 


(b) 
Figure 1.3 Epipolar Geometry 


In epipolar geometry, the projection of this ray on the image is a 
two-dimensional line called the epipolar line corresponding to the point. Assuming 
that the centers of the two cameras are O; and O,., the line between them is called 
the baseline. The intersection of the baseline and the two view planes is called the 
epipole, which is denoted as e, and e,. All epipolar lines must go through the 
epipole. Therefore, if the internal and external parameters of the camera are known, 
all these epipole and epipolar lines can be completely determined. In fact, the 
epipolar geometry can be obtained without completely determining the camera 
parameters, which can be calculated by using a 3x3 fundamental matrix F. It can be 


seen that the two corresponding points p, and p,; on the binocular satisfy the 
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following epipolar geometry constraints: 
D' Fp, =0 
P; Pi eo (1.7) 


Here, p, and p, respectively represent as the homogeneous coordinates of p, 
and p,. The relationship between the fundamental matrix and the camera 
parameters is: F = K’7[T],,RK +, where [T],R is called the essential Matrix. For 
binocular views, if the fundamental matrix F has been calculated, then the image I is 
used as the reference image. The projection matrix of these two images can be set as 


follows: 


pl = [Lies | O55] 


p2=el, |F] oe 


However, there are some uncertainties in the solution of binocular camera 
parameters. The solution stability is poor especially in the case of unknown internal 
parameters. Even if the projection error is small, the recovered projection matrix may 
still be very different from the true solution. Therefore, we generally use multi-view 
geometric relations to calibrate the camera, which is much more reliable than 
binocular views. 

(3) Multi-view Geometry and Self-Calibration 

Before using SfM for camera calibration, it is necessary to obtain the matching 
relationship between two-dimensional feature points in each image. There are many 
existing feature point matching methods, such as Harris corner detection [Ste88] and 
KLT| matching method [LK81, STO2]. This type of method is more suitable for 
situations where the camera motion changes less, such as matching between 
consecutive frames in video. Another feature point matching method is the 
invariant-based method that has emerged in recent years [Low04, WBO7]. This 
method can still stably extract and match feature points when the camera is making 
large displacements and rotations or even camera zooming. So it is more suitable for 


the applications such as image matching with panoramic and large viewing angle 
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changing. Video sequence matching often utilize the coherence of the previous and 
subsequent frame images to achieve frame-by-frame matching. It usually extracts 
feature points first, and then finds out the matching points in the next frame. Using 
the epipolar geometry principle [Zha98], the outlier can be eliminated by the 
RANSAC method [FB87]. Matching feature points on the video sequence are 
connected to form n feature tracks, which is denoted as_ x = {Xj | j=1,...m}. The %; 
represents as the two-dimensional position on the i-th image. Each feature track 


corresponds to a 3D point in the scene. 


Assume that we track the n feature trajectories (ee It corresponds to n 3D 
points {XVier in space. The projection of any 3D point j on the i-th image can be 


expressed as: 
X= m(PX ;) (1.9) 


Where the projection function is (x,y,z) = (x/z,y/z) and P; = K;[R,|T;] 
is the 3 x 4 projection matrix corresponding to the i-th image. x;; is a 2D mapping 
point of the 3D point X; on the i-th image. In actual situations, not every 3D point 
can find a corresponding 2D point in any image due to the occlusion. So we need to 
define a valid matrix W = {jj | i=1...n;j=1,--m} to describe the above 
correspondence. When the 3D point X; is visible on the i-th image, then w;; = 1, 
otherwise w;; = 0. Due to the problem of noise, Equation (1.9) is generally not 
strictly established. Least squares algorithm will be usually used to construct the 
following optimization objective function: 

MEX | — Xj 


Ce Le BEDS} (1.10) 


i=l j=l 


By minimizing the objective function E, the projection matrix of each image and 
the spatial position of each three-dimensional point will be calculated. The objective 


function (see Equation (1.10) ) is also called Bundle adjustment[TMHF99], which 
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optimizes both the camera parameters and the positions of the 3D points together as 
variables. In general, it is very complex and has no direct analytical solution. 
Therefore, iterative optimization method is usually used to solve such problem. It can 
be seen that how to effectively select a good initial value is one of the important step. 
In order to achieve these goals, many researchers have proposed a variety of SfM 
methods and achieved good quality camera calibration. But, early SfM methods 
often had strong constraints, such as the setting marker point [KBO2] is necessary, 
the scene was assumed to have plane [ASOO], the camera position was fixed only for 
pure rotation [Har94] and so on. 

With the development of camera calibration algorithm, existing methods are 
already available to accurately handle the free movement of the cameras. Traditional 
SfM algorithm usually calculates the parameters through two frames [Zha98] or 
three frames [AS98, SW95], which initializes the three-dimensional structure and 
camera motion parameters in the projective space through algebraic operations. 
Then, parameters are transformed into the metric space through self-calibration 
technique [Tri97, PKVG98] to achieve reconstruction in the metric space. The 
self-calibration is the process of automatically calculating camera internal parameters 
by using two-dimensional information of the image. Early self-calibration techniques 
can only deal with the cases where the focal length is unknown but fixed [FLM92, 
Har93, Tri97]. In recent years, some self-calibration methods [HA97, PKVG98, 
PVGVO04] with the capable of dealing with focal length changing have been 
successively proposed. With the publication of some classic papers and works 
[PVGV04, FZ03, HZO8], the theory of SFM becomes more and more mature. Even 
some commercial software was also developed and successfully applied for 
advertising designing and film production. 

2. Stereo Matching Techniques 

Human beings perceive the three-dimensional world through the cooperation of 
two eyes. This is due to the joint action between the eyes and the brain. When two 
eyes are focusing on the same object, the view angle of each eye is always different. 


This will result in some small differences between the two images projected on the 
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retina. As shown in Figure 1.4, the same physical point in the scene is respectively 
projected on the point p, in left view and the point p, in right view. The positions 
of these two points are deviated due to different viewing angles, and this small 
difference is also called disparity. After analysis, we realized that there is an inverse 


relationship between the disparity and the depth information. 


Disparity = x, —x 


r 


Depth = a 
Disparity 


Base line (B) 


Figure 1.4 Binocular Stereo 


Our brain processing mechanism can automatically convert disparity into depth 
information, which allows us to easily perceive solid objects in real space. The stereo 
matching algorithm is just proposed to automatically recover image depth 
information by simulating the human brain through computer vision technology. It is 
the core algorithm in 3D reconstruction, which directly determines the accuracy and 
operational efficiency of the entire reconstruction system. Based on different types 
of stereo matching algorithms, Scharstein et al. [SSZO2] performed a detailed 
classification and evaluation of various well-performing stereo matching algorithms, 
which can generate dense disparity maps from binocular images. They pointed out 


that the existing stereo matching algorithm usually consists of four independent 
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modules: matching cost calculation, cost aggregation, disparity estimation and 
disparity refinement. According to the research direction of existing algorithms, they 
also can be divided into local algorithms and global algorithms. 

Local stereo matching algorithms typically utilize a window searching method to 
find out the corresponding points between two views to estimate disparity 
information. However, there is no correlation between adjacent windows, and pixels 
in same window are considered to have the constant depth information. There are 
many classic stereo matching methods being based on window search strategies. For 
example, Xing et al. [LZLY14] utilize rotation skeleton based region (RSBR) to estimate 
disparity information. They developed adaptive local regions to generate more 
accurate regions for depth estimation. However, this algorithm does not provide 
highly robust disparity estimation, especially in areas with less texture and occlusion. 
Cao et al. [PNJ13] proposed an information permeability based local stereo matching 
algorithm. Experimental results demonstrate that their algorithm significantly 
improves parallax consistency compared to other traditional methods. Jiao et al. 
[DNNJ12] proposed an more optimized local stereo matching algorithm, which 
integrates non-parametric transforms and edge-preserving filters to optimize 
disparity estimation and further improve stereo matching performance. Their 
method can estimate the disparity information more stably under different lighting 
conditions and different devices. Xia et al. [XYJW13] proposed an effective local 
stereo matching method based on extended triangular interpolation. They utilize the 
triangle mesh to represent the image and proposed a stereo matching algorithm 
based on Bayesian model to achieve dense disparity estimation. It is worth 
mentioning that some recent local stereo matching methods have achieved 
performance comparable to the global method. Jiao et al. [JWW14] proved that the 
performance of their proposed algorithm is comparable to the global approach. They 
used a new combined cost function and a secondary disparity refinement to 
eliminate residual outliers in parallax images. The advantage of local stereo matching 
is low complexity and fast processing speed. However, due to the lack of 


consideration of the global relationship between pixels, most local algorithms are still 
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overly sensitive to noise, texture less regions, structural discontinuities and 
occlusions, which leads to an inability to improve the accuracy of depth estimation. 

In contrast to local algorithms, the global stereo matching methods consider that 
neighbor pixels may have the same depth information. It establishes smoothing 
constraints for the depth between adjacent pixels. Therefore, the depth estimation 
problem is converted into an energy minimization problem in global algorithm. 
Recently, there are some outstanding works are proposed such as: Atlanta et al. 
[AOK15] proposed a novel segment-based stereo matching algorithm. A fast method 
with an edge preserving property is applied for the initial disparity map estimation, 
which can provide globally graph cuts for the disparity plane. The experimental result 
showed that accuracy of the disparity is improved but reduce the operational speed. 
Yang [Yan12] deeply studied the cost aggregation problem and first proposed a 
non-local stereo matching algorithm, which can achieve adaptive aggregation 
matching cost calculation based on pixel similarity in a tree structure. Experiments 
have shown that their solution is superior to all the other local cost aggregation 
methods in the standard Middlebury benchmark. Later, Yang [Yan15] further 
optimized their non-local method as every node receives supports from all other 
nodes on the tree. This method obviously reduced the complexity into linear level. 
The great advantage of Yang’s methods is its extremely low computational 
complexity. However, due to the poor accuracy of disparity calculation, the quality of 
depth information estimation is still unacceptable. Chen [CAWC13] introduced a 
novel weighting function that takes the color cues and boundary cues of the 
reference color image into account. The experimental results show that the method 
improves the efficiency of stereo matching calculation, but the accuracy of disparity 
estimation is low. Huang [HCZ15] proposed a fast non-local parallax refinement 
method based on disparity confidence propagation. The performance evaluation of 
the standard Middlebury benchmark proves that it effectively achieves higher 
precision disparity estimation. 

Unfortunately, even if the global stereo matching algorithm can provide higher 


disparity accuracy than the local method, most of the above algorithms can only 
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provide pixel-level precision disparity estimation. Moreover, due to the high 
complexity of convergence and energy functions, the computational speed of global 
algorithms is often slow. In summary, the study of stereo matching algorithms has 
been a long history, many good attempts were made to continuously improve the 
performance of the stereo matching algorithm during the period. However, the 
current state-of-the-arts still has many defects and bottlenecks, which is still 
waiting for researchers to find out better ways to solve them. 

1.2.3. Target Detection 

The target detection algorithm is mainly used to automatically locate and 
identify objects in images or videos, while distinguishing the categories of each target 
by computer vision technology. It is widely used in smart driving, drone and artificial 
intelligence applications. This paper mainly studies the static image based target 
detection algorithm. The traditional target detection method can be generally 
divided into three phases: 

(1) candidate regions selection; 

(2) features extraction; 

(3) objects classification. 

The traditional target detection algorithm mainly uses artificial features to detect 
the target. For example, Navneet Dalal and Bill Triggs [DTO5a] attempt to extract 
human body features with motion information by using a histogram gradient (HOG) 
descriptor. However, the calculation of HOG reduces the speed of operation. At the 
same time, the algorithm is sensitive to ambient noise and it is difficult to deal with 
occlusion problems. Ojala et al. [OPM02] designed a simple texture operator and a 
local binary pattern that marks the pixels of an image by thresholding each 
neighborhood of the pixel and also treats the result as a binary number. The 
complexity of the algorithm is simpler and more efficient than traditional artificial 
feature based algorithms. Unfortunately, because the algorithm only considers the 
local texture features of the image, which greatly affects the detection accuracy of 
the algorithm. Dollar et al [DABP14] simply smooth and down sample the pixel 


channels and then extract the aggregate channel features from a single pixel lookup. 
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The biggest advantage of this algorithm is that it simplifies the complexity of feature 
extraction, which will significantly increase the detection speed and reduce the 
capacity of the feature pool. However, this feature extraction strategy cannot 
accurately detect small-scale pedestrians. Rodrigo et al. [BOHS14] carefully evaluated 
more than 40 target detection algorithms in Caltech Benchmark and analyzed the 
research direction of pedestrian detection in the past decade. After a lot of 
experiments, they combined the various techniques by using the complementarity of 
various algorithms and proposed the Katamari-v1 (SquaresChn-Ftrs + DCT + SDt + 
2Ped) algorithm. They achieved a 22.49% miss rate on the Caltech dataset by using 
the HOG + Flow feature. According to the above description, there are two main 
problems in traditional target detection: 

(1) The region selection strategy based on sliding window is not targeted, which 
will lead to higher time complexity and window redundancy; 

(2) The hand-designed features are not very good for the diversity of true natural 
object. 

However, Rodrigo's work primarily scarified computing efficiency to integrates a 
variety of related methods with multiple functions to improve performance. 

Recently, Geoffrey Hinton et al. [HSO6] first proposed the concept of deep 
learning in the article ”’Science”. The main idea of their paper is to simulate the 
studying process of the human brain by designing a deep neural network. The 
features of target are transferred from low levels to high levels in neural network. 
The higher the level is, the more abstract the features are. The final output result of 
the neural networks is the most accurate feature expression of the target. With the 
development of various neural network frameworks, more and more target 
detection algorithms have been proposed. Such as: R-CNN first introduced deep 
learning into the target detection domain. They eventually unified all the steps of the 
target detection under the deep learning framework, which means that all the 
calculations can be performed within the GPU and the calculation accuracy and 
speed have also greatly improved. Fast R-CNN [Gir15] employs several innovations to 


efficiently classify object proposals and improve training and testing speed while also 
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increasing detection accuracy. In order to solve the low speed problem of selective 
search in the CPU, Faster R-CNN [RGGS15] creates an RPN network to instead of the 
selective search algorithm for candidate box selection, which makes the entire target 
recognition truly achieved end-to-end computing. It unifies all tasks under the 
framework of deep learning and has greatly improved both speed and accuracy. 
Although above algorithm can exceed many state-of-the-art methods, the operation 
is still time-consuming due to its complex framework. However, none of the current 


algorithms are good enough to show accuracy and efficiency in pedestrian detection. 


1.3 The Research Work and Article Structure 


1.3.1 Our Solution and Challenges 

In this paper, our main target is to achieve a more robust and faster multi-view 
based 3D reconstruction system to adapt future intelligence applications. Based on 
the investigation and discussion about the current defects of state-of-the-art 3D 
reconstruction methods, we firstly conduct in-depth research on how to improve 
performance of the core algorithm stereo matching. Our primary goal is to develop a 
novel stereo matching algorithm, which can simultaneously provide high accuracy 
and efficient multi-view depth estimation. Then, we will try to develop and integrate 
various effective image processing and detection algorithms to assist stereo matching 
to finally achieve higher quality of multi-view based 3D reconstruction in faster 
computation speed compared with recent state-of-the-art algorithms. 

As shown in the flowchart of our 3D reconstruction algorithm (see Figure 1.5), 
our proposed multi-view based 3D reconstruction can be regarded as a complex 
system consisting of multiple image processing and detection modules. The input 
data is mainly divided into two categories. The first type is a binocular image pair or a 
multi-view image sequence, which is used for three-dimensional reconstruction. The 
second category is based on existing pedestrian video data sets, which are used to 
train pedestrian detection networks. The output data is a sequence of depth images 
of various views. Among them, module 1 is an image deblurring algorithm as our 


pre-processing module, which is used to eliminate the noise and improve the 
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accuracy of the input data. Module 2 is a pedestrian detection algorithm that 
provides a priori knowledge to our proposed stereo matching algorithm, which is 
used to assist in depth estimation to eliminate dynamic pedestrian influences in the 
real scene and avoid extra computation for unnecessary areas. Module 3 is our core 
algorithm, which is used to provide high accuracy and high efficiency depth 
estimation for 3D modeling. Module 4 is the depth map up-sample algorithm as our 
post-processing module, which is used to further improve the accuracy and 


visualization of the output data. 
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Figure 1.5 The Flowchart of Our Research Work 


In order to explain the role of each link and the direct collaboration between 
each module more visually and easily, we provide an intermediate result diagram to 
further explain the structure of our proposed algorithm. As shown in Figure 1.6, we 
take a realistic multi-view image sequence with pedestrians as the input example. It 
should be noted that our pedestrian detection algorithm has been trained under the 
Caltech Pedestrian data set and can accurately locate the pedestrian areas. Our goal 
is to reconstruct the depth image of the face sculpture scene in the images at 
different viewpoints. In our algorithm, the proposed novel image deblurring 


algorithm will first restore the clarity of input images. Then based on the high quality 
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images, our proposed pedestrian detector will accurately mark the location and size 
of each pedestrian area and send a priori information used to constrain the depth 
estimation area to stereo matching. Base on the priori information, our proposed 
non-local stereo matching can easily avoid the interferences from the dynamic 
pedestrians and efficiently generate high accuracy depth images in each view. Finally, 
our proposed depth interpolation algorithm will further reduce the noise in each 
depth result. As we can see from the figure, our depth image can accurately describe 


spatial geometric information of color images. 
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Figure 1.6 The Intermediate Results of Our Research Work 


The challenges related to above research are as below: 

1. Deblurring algorithm requires high quality image recovery within low 
computational cost 

The time complexity of the blur kernel estimation algorithm based on the entire 
image is too high. However, the quality of image restoration based on the method of 
the selected patch is too low. At present, none of the existing blind deblurring 
algorithm can achieve high precision and fast image deblurring at the same time, 


especially for large size images. 
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2. Stereo matching algorithm requires sub-pixel accuracy depth estimation 
and a balance between accuracy and efficiency 

Most of the current stereo matching algorithms can only provide one-pixel 
precision estimation in order to reduce the disparity label searching space, which will 
significantly reduce the depth details of the 3D models. Meanwhile, any of the 
current global or local stereo matching algorithms is insufficient to show matching 
accuracy and calculation efficiency especially for large-size images based 3D 
reconstruction, thereby rendering numerous stereo matching-based applications 
incapable of achieving their desired performance. 

3. Depth map interpolation requires to eliminate edge blurring and texture 
copying artifacts 

Error matching often occurs between depth images and color images, which 
leads to the problem of edge blur and texture copying in existing depth map 
up-sampling algorithms. This problem is easily caused especially in a smooth area of 
an image. 

4. Pedestrian detection requires low false positives and real-time processing 

At present, none of the current pedestrian detection algorithms based on 
artificial features or deep learning model can be able to accurately handle multi-scale 
pedestrians (especially small scale) and occlusion problems. Moreover, the structure 
complexity of existing convolutional neural network based algorithms is too high, 
which leads to the calculation efficiency cannot meet the requirements of practical 
applications. 

1.3.2 Research Content and Article Structure 

The research content of this article is mainly reflected in the following aspects: 

1. Accurate Blind Deblurring Using SalientPatch-Based Prior for Large-Size 
Images 

We propose a SalientPatch-based blur kernel estimation algorithm by calculating 
objectness probability, structural richness and local contrast. Our method can 
significantly improve the speed and quality of deblurring of large-scale images. 


2. Hybrid Tree Guided PatchMatch and Quantizing Acceleration for Multiple 
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Views Disparity Estimation 

We generate an initial disparity map by using hybrid tree cost aggregation to 
constraint label searching range. Then, we provide a novel plane refinement strategy 
for PatchMatch stereo to accurately calculate the final disparity. During the matching 
cost computation, an effective quantizing acceleration strategy is also proposed to 
accelerate the computation speed. It will achieve the balance between the 
estimation accuracy and computation efficiency. In addition, this algorithm is 
designed to be suitable for multi-views reconstruction in the real scene. 

3. Joint Depth Map Interpolation and Segmentation with Planar Surface 
Model 

Based on image segmentation and interpolation complementarity, we propose a 
depth map up-sampling algorithm based on a specially defined joint interpolation 
and segmentation model. We use image segmentation to solve the problem of edge 
blurring. At the same time, we use multiple planes to fit the depth map to estimate 
the missing depth information to solve the problem of texture duplication. 

4. Real-Time Pedestrian Detection via Hierarchical Convolutional Feature 

We propose a new fully connected convolutional neural network for pedestrian 
detection. Among them, feature maps of different layers are assigned to detect 
pedestrians of different scales to achieve high-precision detection of small-scale 
pedestrians. At the same time, we propose a new prediction box that perfectly fits 
the shape of pedestrians to improve the speed of pedestrian detection and solve 
occlusion problems. 

The structure of this article is as follows: 

The chapter 1 is introduction. It mainly introduces the research background, 
related works, challenges, content and article structure and also gives a general 
description of the basic principles and methods related to our research. 

The chapter 2 is image deblurring. It is the pre-processing of our 3D 
reconstruction algorithm. In this chapter, we will deeply analysis current problems of 
blind image deblurring and find out the solution. It includes a_ novel 


concept "SalientPatch” being proposed, generation of interest area for kernel 
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estimation and so on. 

The chapter 3 is pedestrian detection. It provide a priori information of stereo 
matching. In this chapter, a novel convolution network will be learned for pedestrian 
detection. It includes a fully convolutional network being created, a novel prediction 
box with a single specific aspect ratio being designed and so on. 

The chapter 4 is depth estimation. It is the core processing of our 3D 
reconstruction algorithm. In this chapter, we will summarize the bottleneck problems 
of state-of-the-art stereo matching algorithms and give our solution. It includes a 
hybrid tree cost aggregation being proposed, a novel plane refinement strategy for 
PatchMatch stereo and an effective quantizing acceleration strategy. 

The chapter 5 is depth interpolation. It is the post-processing of our 3D 
reconstruction algorithm. In this chapter, we will briefly analysis the segmentation 
and interpolation algorithm and find out the relationship between them. It includes a 
joint depth map interpolation and segmentation method being proposed for depth 
up-sample. 

The chapter 6 is conclusion. This chapter summarizes the achievements of our 
research and proposes directions for further in-depth research. 


The appendix A is multi-view epipolar geometry formula derivation. 
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Chapter 2 
Image Deblurring as Pre-processing 


2.1 Introduction 


Most blind deblurring algorithms adopt a full image based kernel estimation 
strategy, which is often too sensitive to smooth and fine scale background regions. 
This strategy will easily lead to erroneous estimation while it is usually inefficient for 
large-scale image deblurring. It can be seen that blur kernel estimation based on 
complete images is not a clever strategy. Another important natural law [ADF10] 
shows that people first focus on the obvious foreground area when looking at 
strange scene images. We realize that the evaluation criteria for determining 
whether a blind deblurring algorithm is valid should be whether the foreground 
object in the blurred image can be well deblurred. The most important thing for the 
deblurring problem is to restore the area of interest to the human eye rather than 
the background. Therefore, the blur kernel estimation should be based on a carefully 
selected local region. Furthermore, existing region-based methods enumerate all 
possible local regions and select the best local regions, which is impractical because 
it consumes a lot of time. 

We designed the blur kernel similarity heat map in Figure 2.1 to fully 
demonstrate our findings. We first calculate the similarity values between the 
groundtruth kernel and each of the core candidates estimated based on each sliding 
window of size 300 x 300. Finally, each pixel value of the heat map is obtained by 
averaging the kernel similarity values of each sliding window. In this heat map, we 
use the shade of the color to indicate the level of similarity value. A high kernel 
similarity region (a yellower region) indicates that kernel estimation in this area can 
obtain a high quality recovery. A low kernel similarity value (bluer area) represents a 
bad area containing a smooth shadow or background, which indicates that the area 
is not suitable for blur estimation. Comparing the color image with the heat map, the 


object of interest in the foreground is consistent with the yellow region. This means 
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that the image area with the object has rich key features, which is beneficial to 


improve the accuracy of the blur kernel estimation. 


Blurry Image Ground Truth Kernel 


Kernel Similarity Map 


Figure 2.1 Kernel Similarity Heat Maps 


Notes: Kernel similarity heat maps built on the estimated kernels from sliding sub-windows. 


Blue to yellow pixels indicates similarity low to high compared with groundtruth kernel. 


Based on the heat map analysis, we proposed a new concept called SalientPatch. 
It is a carefully selected area that contains high kernel similarities and eye-catching 
areas from the input image. SalientPatch is designed to support a variety of blur 
kernel estimation algorithms, especially for large-sized blurred images. We first 
created three image layers based on a multi-level segmentation algorithm. An 
interest graph based on three related cues (object probability, structural richness, 
and local contrast) is then calculated for each layer. Next, we fuse the three layers of 
images to generate a final interest map. Finally, based on the map we can locate our 
SalientPatch region. Although these cues are derived from the current 
state-of-the-arts, but it is an innovative work that we integrate them for locating 
patches for kernel estimation. As shown in the Figure 2.2, SalientPatch is first 


automatically positioned based on the obtained interest image. The blur kernel is 
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then estimated over the selected range. 


(a) Located SalientPatch (b) Restored Image 


Figure 2.2 Located SalientPatch and Restored Image 


Notes: SalientPatch locating result and corresponding restored image. 


Finally, the blur image is restored based on different kernel estimation 
algorithms. One of the important innovations in this algorithm is to propose a new 
kernel estimation method based on SalientPatch, which can be widely applied to 
different types of maximum a posteriori (MAP) [KTF11] [KSM13] [PSPY16] framework. 
In our blind deblurring algorithm, the range of kernel estimates is limited to our 
defined SalientPatch. In this way, the estimated kernel can be infinitely approached 
to the groundtruth kernel and is related to interesting areas of human vision. The 
experimental results show that the proposed algorithm can significantly improve the 
speed of kernel estimation especially for large-size blurred images, while also 


guarantee high-quality image restoration. 
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2.2 Related Works 


The image deblurring algorithm can automatically recover degraded images to 
clear images through a series of complex image processing. At present, image 
deblurring algorithms can be summarized into two categories: non-blind deblurring 
approach and blind deblurring approach. A non-blind deblurring algorithm must first 
provide a known blur kernel. According to a given blur kernel, it can generate a 
corresponding clear image directly on the blurred image. Blind deblurring restores 
degraded images in the case of unknown blur kernels. So it must first estimate the 
blur kernel and then use the non-blinded algorithm to recover the sharp image. 
Obviously, the blind deblurring algorithm is more suitable for practical applications. 
Most traditional blind deblurring methods focus on finding suitable natural image 
priors or additional image observations to recover the image. For example, an 
algorithm based on Maximum A Posteriori (MAP) [CW98] ([KTF11] [KSM13] [PSPY16]) 
collectively considers that natural images have Gaussian features and the blurring 
process tends to deviate from natural image features. Kotera et al. [KSM13] used a 
heavy-tailed prior to describe a blurred image, then restore the image by using the 
augmented Lagrangian method. Although the performance of this algorithm has 
surpassed many prior art methods, its recovery effect is not inaccurate for different 
types of images (eg, face, low illumination, and text image) except for natural scene 
image restoration. Krishnan et al. [KTF11] proposed to apply a novel normalized 
sparsity prior to assign clear natural images lower energy than blurred images and 
then utilized the Iterative Shrinkage-Threshold algorithm to optimize the latent 
image. The algorithm significantly improves the speed of the traditional MAP 
algorithm, but it still cannot provide high-precision deblurring due to poor image 
brightness and gradient description. In order to significantly improve the accuracy of 
image deblurring, Pan et al. [PSPY16] proposed a novel Ly-regularized dark 
channel prior based optimization method to improve the performance of image 
deblurring. This algorithm achieves high quality deblurring on different types of 


images, but the optimization of the algorithm is very complicated, which leads to 
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very time consuming especially for large size images. Be aware that the Ly norm is 
highly non-convex. Recent studies have found that image structure plays a decisive 
role in estimating blur kernels. To speed up blind deblurring, the blur kernel will be 
estimated in the selected good area. Bae et al. [BFC12] utilized mosaic image patches 
found in the original blurry image to recover blur kernels. This algorithm significantly 
improves the estimation efficiency, especially for large size images. However, the 
selection of the kernel estimation region only considers the edge richness cue and 
ignores other factors (such as object probability, local contrast), which will reduce the 
accuracy of the blur kernel estimation. Hu and Yang [HY15] proposed to study good 
regions for deblurring within the conditional random field framework. However, this 
kind of learning-based method will additionally increase the complexity of the 
algorithm. Although it can improve the accuracy of a blur kernel, the efficiency will 
be significantly reduced, which is not suitable for practical applications. 

Research on saliency detection algorithms has a long history. Until now, the 
saliency detection technology has matured enough and there has been a lot of 
excellent related work. The goal of saliency detection is to automatically mark areas 
of the image that humans care most about through computer vision technology. The 
marked areas often have valuable information. In addition, the current co-saliency 
detection algorithm can locate the same foreground object in a set of images sharing 
common characteristics. Such as, [ZWL17] [ZLL17] [ZHHS16] [SZL17] implements a 
neural network or low rank representation to distinguish complex backgrounds while 
detecting foreground objects. However, the detection cues used in most of the 
researches are only adapted for clear scene images, which cannot tolerate camera 
shake or vague blur. And the blur kernel estimation in the blind deblurring does not 
present the image prior. In addition, neural network-based methods tend to limit the 
size of the input image during training. If the size of the input image is too large, 
image scaling and cropping must be performed on images that do not meet the size 
requirements, which can lose a lot of useful information and reduce usability. 
Obviously, it is not suitable for blind image deblurring tasks. 


In summary, the full-image based blur kernel estimation algorithm requires a 
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large amount of calculations, and its operational efficiency is unacceptable especially 
for the recovery of large-size images. Although the patch-based algorithm is relatively 
faster, it sacrifices image recovery accuracy. At present, there are few blind 
deblurring algorithms that can simultaneously provide high precision and fast 
computation speed for large-size image restoration. To solve this bottleneck issue, we 
defined a novel patch-based kernel estimation method. Based on careful kernel 
estimation region selection, our proposed algorithm can generate the same or even 
higher estimation accuracy than the full image based blur kernel estimation while 


also maintaining efficient computational speed. 


2.3 Proposed Method 


2.3.1 Overview 

In our algorithm, it mainly focus on large size blurred images as input. The 
algorithm will first calculate the center position of the SalientPatch. Then, blur kernel 
constrained with SalientPatch will be generated and finally restores the degraded 
image. The detailed process of the algorithm is as shown in Figure 2.3, which the 


green square region is our proposed SalientPatch. 
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Figure 2.3. The Flowchart of Our SalientPatch 
Notes: Our SalientPatch based deblurring process. The green dashed square region in the 
threshold binary map is the located SalientPatch, and the two orange dashed rectangles contain our 


innovative work. 
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2.3.2 SalientPatch Definition and Locating 

We define SalientPatch as a partial square image area that contains the most 
interesting information in the image or a local part with rich detail in the blurred 
image. As previously analyzed, not all pixels in a blurred image can promote the 
accuracy of the blur kernel estimation. In an blur image, only a few foreground 
regions with useful structural information can effectively improve the accurate of 
kernel estimation. Therefore, our proposed deblurring algorithm limits the 
estimation range by using SalientPatch to ensure the accuracy and efficiency of blind 
deblurring. In order to achieve fully automatic positioning patch, and to avoid the 
large-size image (more than 800x600 in this paper) exhausting the resources of the 
pixel-level algorithm, we calculate the image features through the small sub-window 
according to the method proposed by [HY15]. In order to achieve the generation of a 
suitable interest image, a multi-level segmentation algorithm is first applied to create 
multiple image layers [YXSJ13]. Next, each layer is analyzed to generate its 
corresponding layer interest graph. All resulting images will be linearly merged into a 
refined interest image and finally binarization to quickly locate our SalientPatch. 

1. Multi-level Segmentation 

Before generating the interest graph, we first use the Sticky Edge Adhesive 
Superpixel Detector [DZ13, DZ14, ZD14] to decompose the blurred image into 
compact superpixels. The reason for choosing the Sticky Edge Adhesive Superpixel 
Detector is that it produces a multi-layered superpixel image that promotes the 
accuracy of saliency detection. At the same time, the performance of the superpixel 
map formed by the algorithm is better than other excellent over-segmentation 
methods. The specific advantages of the algorithm include: 

(1) This edge detection strategy based on the internal structure of the local 
image block provides technical support for our patch selection method. 

(2) Since the algorithm fully considers the structural information, the generated 
superpixel has the characteristics of being close to the important edge. 

(3) It uses the random decision forest framework to learn the local edge mask for 


segmentation calculation, which has a higher running speed. 


33 


Chapter 2. Image Deblurring as Pre-processing 


After the experiment, we found that the number of superpixel generated directly 
by using this algorithm is still too large and the size is small, which is not suitable for 
interest image extraction. Therefore, we need to further merge the generated 
superpixels to increase the efficiency and precision of the extraction. In the merging 
process, we chose to generate a multi-level clustering map using the Density-based 
spatial clustering of applications with noise (DBSCAN) clustering method. Superpixels 
with closed feature distances in the CIELAB color space will naturally be clustered 
together, while smaller edges are also completely preserved (see Layer 1, Layer 2 and 
Layer 3 in Figure 2.3). At last, three levels of superpixel maps are generated in our 
algorithm. 

2. Interest Image Generation 

We define the interest image as a grayscale map in which different grayscale 
levels of pixels represent different saliency values (white to black indicate saliency 
from high to low). We use multiple superpixel related cues to generate an interest 
graph for each layer. Existing interest image generation algorithms focus primarily on 
intensity or color aspects, which are typically described by computing the contrast of 
adjacent regions of the image. It is worth noting that most of the algorithms can only 
be applied to clear images. Since our input is a blurred image, it is not feasible to use 
existing methods directly. Therefore, we have integrated three additional related 
cues (object probabilities, structural richness and local contrast) to generate interest 
graphs in multiple levels, thus solving the problem that the boundaries of the objects 
are not obvious. 

@® Objectness Probability 

Alexe et al. [ADF12] first proposed the concept of object probability, which 
measures the probability of including a target foreground in each candidate box by 
sliding the window. They assume that a foreground object must have two 
characteristics: the object has a complete boundary and is in sharp contrast to the 
surrounding background. In their proposed algorithm, the possibility of measuring 
whether a candidate box contains a target is determined by calculating the score. 


However, in our algorithm the object probability algorithm needs to measure the 
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probability that each superpixel belongs to the interesting target part. To achieve this 
goal, we first use the method proposed in [ZLL + 17] to obtain a pixel-level object 
map OF, with the same input image size: 
OP,(x)= > P(b) (2.1) 
beB(x) 
Where P(b) is probability of a single box candidate, and B(x) represents all 
patches that contain pixel x. In order not to affect the processing efficiency, we limit 


the total number of patches to less than 50. Based on the pixel-level probability, we 


can easily get the super-pixel-level object probability OP, as follows: 


OP(x)=———_¥0,(y) (2.2) 


numelKS ) zs 


Where S represents the current superpixel and numel(S) represents the total 
number of pixels the superpixel contains. 

@ Structure Richness. 

Cho and lee [CLO9] found that the background area of the image tends to be 
relatively smooth or have very small scale edges. If the region of the blur kernel 
estimation contains too much background information, the deblurring image will 
have significant ringing artifacts. Obviously, the smooth region retains smoothness 
after blurring, and the background information can seriously reduce the accuracy of 
the kernel estimation. Instead, there is a rich edge structure inside and on an 
interesting foreground area that contains important information good for blur kernel 
estimation. Therefore, we should consider measuring the amount of structure of 
each superpixel in each layer in the process of generating the interest image. 

Since the superpixel generated by the Sticky Edge Adhesive Superpixels Detector 
is closely integrated with the edge of the image, it is necessary to consider both the 
internal pixels and the boundary pixels of the superpixel when calculating the 
structural richness. We define the structure richness as the average of the Euclidean 


norms of the gradient of the main structure within each superpixel: 
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SR (x) = — 


VI(y)I 
: er | (2.3) 


yeS 


Where I, is the main structure of the image, which can be efficiently detected 
by the random forest framework [DZ13]. 

@ Local Contrast. 

Local Contrast cue describes the degree of color difference between superpixels 
by calculating the color histogram distance between current superpixel to its 
neighbors. Obviously, if the color differs from the surrounding background color, the 
local contrast value will be higher. According to the previous analysis, the color of the 
interesting area is usually quite different from the background color. So according to 
this feature, we calculate the local contrast using an algorithm similar to Tong et al. 


[TLZX14]: 
N; 
LC,(¢,) =D) a,h(d(c,.¢;)))* g(a y) xq) (2.4) 
j=l 


Where the c; is the current superpixel, Nj; is total number of neighboring, wi; 
is the ratio of the superpixels c; to the total area of its neighborhood {c;}. The 


function d(c;,c;) is the histogram distance. The function h(@) = —log(1 — ) 
will keep the output positive. The function g(x,y) is used to calculate the 
normalized spatial distance from the centroid of the superpixel c; to the center of 
the image (X9,Vo), g(u) is the number of pixels on the boundary of the image. 

A large number of experiments have confirmed that the background of the 
interest image based on a single objectness cue is more suppressed. Interest image 
generated by using a single structural richness cue cannot distinguish well between 
foreground and background. The local contrast cue based interest image is too 
scattered, which is not suitable for our SalientPatch selection. Obviously, using any 
single cue does not provide accurate region selection for blur kernel estimation. As 


shown in Figure 2.4, since all three cues are complementary rather than 
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contradictory, we hope to integrate these cues to obtain an interest image using the 
following equation: S = exp(SR,+LS;) X OP;. It should be noted that all 
operations in the equation are based on superpixels. Formula exp() represents the 
exponential function. Our SalientPatch selection strategy makes good use of the 
characteristics of each cue and makes them complement each other. The interest 


image we provided can effectively promote the estimation of blur kernels. 


(a) (b) (c) (d) 


Figure 2.4 Comparison among Different Cues and Integrating Cues 


Notes: 
(a) The Objectness Probability suppresses the background too much; 
(b) The Structure Richness cannot distinguish foreground well from the image; 
(c) The Local Contrast may spread too much for the patch selection; 
(d) The integration of three cues achieves completeness and promotes the accuracy of the 


SalientPatch locating. 


It can be seen that strong edges in the image can promote the accuracy of blur 
kernel estimation, so we should choose regions with more structural richness to 
pursue high quality recovery. However, for some extremely blurred images with 
unclear structures, the choice of a suitable SalientPatch is difficult. To further 
demonstrate the advantages of combining the three features, we used a more 
blurred image to test my algorithm. As shown in Figure 2.5, the structure 
richness-based features are not able to correctly select SalientPatch for the input 
image shown in sub-picture (a). The other two features (local contrast and object 
probability) make the choices complementary. Sub-picture (b) shows the estimated 


kernel and recovered images based on our selected SalientPatch. Obviously, our 
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proposed SalientPatch can restore high-blurred images very well. Above all, fusing 


three correlative cues to select the patch is beneficial for deblurring. 


(a) (b) 


Figure 2.5 Another Deblurring Result for More Blurred Image 


Notes: (a) Blurry Input; (b) Our SalientPatch and Deblurred Result. 

The feature of the structure richness is no longer feasible for SalientPatch selection in the input 
image shown in sub-figure(a), while the other two features, the local contrast and objectness 
probability, can make the complementary for the selection. Subfigure(b) shows the estimated kernel 


and the restored image based on our selected SalientPatch. 


3. SalientPatch location 

Due to the diversity of multi-layer images, the finest image layers may contain 
small-scale structures, while the coarsest layers contain large-scale structures. It can 
be seen that the single layer image information does not perfectly represent the 
correct saliency. So we fuse the multi-layer interest graph into a final grayscale image 
that can represent the adaptability of the kernel regions of the image regions at 
different scales. Finally, we locate SalientPatch based on this final interest graph. As 
can be seen from Figure 2.6, the result of the roughest layer is the worst because the 
large structure is only between the foreground and the background. Although the 
results of the other two layers of the finer area are better, there are some problems 
with the scale partition. To solve this problem, similar to the work of Yan et al. 
[YXSJ13], we weighted the average of all single-layer to obtain the final interest 
image, where the weight of the finest image layer is larger than that of the coarse 


layer. 
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(a) (b) 


(d) 


Figure 2.6 The Comparison of The Averaged Layer and Three Single Layers 


Notes: None of three single layers from fine to coarse (a)~(c) satisfies, while the averaged layer 


(d) performs better for our need. 


We set the parameters of 0.4, 0.3 and 0.3 to each layer for linear fusion. A lot of 
experiments have shown that this combination can produce a satisfactory interest 
image. Based on obtained interest image, we search the center of the SalientPatch by 
selecting the centroid of the largest connected region in the binarized interest image 
and then set an appropriate size. According to the study of [HY15], they found that 
the sizes selection of the sub-window does not have much limitation and will not 
affect the accuracy of kernel estimation. In our algorithm, we set the size of the blur 


kernel to 35x35 and the size of the SalientPatch to 400x400 empirically. 


2.3.3. Blur Kernel Estimation and Deblurring 

Experiments show that our algorithm has strong universality, which can be easily 
combined with most of the current blind deblurring algorithm frameworks. We only 
need to replace the entire image with the SalientPatch automatic selection strategy, 
which can significantly speed up most blind deblurring methods. Here we accelerate 
the currently accepted outstanding deblurring algorithm as an example. Pan et al. 
[PSPY16] proposed a dark channel-based deblurring algorithm that can achieve high 
quality recovery on various types of images. However, the biggest drawback of this 
algorithm is the high computational complexity O(n), where n is the number of pixels. 
Especially for deblurring large size images, the pan method does not provide 


satisfactory speed (the blur kernel estimation step may even take more than an hour). 
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Unfortunately, the low efficiency slows down the overall performance of the 
algorithm, making good algorithms unusable. Except for SalientPatch selection 
strategy, we can optimize the three key steps in pan’s algorithm to improve the 
efficiency: 

(1) dark channel computation D(/) = MI; 

(2) thresholding dark channel’s auxiliary variable 

u = thresholding(u); 

(3) reversing operation i = M"u. 

where D(I), M, |, u, M', and i are respectively expressed as the dark channel 
image, dark channel linear operator, clear image, the auxiliary variable of dark 
channel image, reverse dark channel operator and deblurring image. According to 
the formula, the reverse operation of the dark channel needs to wait for the entire 
dark channel map to be generated based on the sliding window iteration. Obviously, 
this method is computationally inefficient for large size images. To solve this problem, 
we unite these steps into a single step. After calculating the dark channel pixels in the 
sliding window, we directly perform threshold processing and reverse operations. 
This will avoid waiting for the generation of the entire dark channel map. At the same 
time, we use the median filter to refine the dark channel image to effectively solve 
the image recovery problem [PSPY16] with salt and pepper noise. In summary, our 
algorithm can provide accurate blur kernel estimation for general noise images. It is 
worth noting that our algorithm is based on a uniform convolution operation model, 


so it is only suitable for removing camera motion blur. 


2.4 Experimental Results and Evaluations 


1. Dataset construction. 

Because existing common image dataset standards (such as [GF12] and 
[LWDFO9]) cannot provide large-size images and meet the requirements of 
SalientPatch's interest attributes (some images are grayscale). We designed a new 
data set that is constructed by convolution operation specifically for evaluating large 


size image deblurring algorithms. We need to compare our SalientPatch-based 
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algorithm to other existing full-image based approaches. Since these full image based 
methods will take about two hours to deblurring each image, so it is not necessary to 
create a data set with hundreds of large size blurred images. There are 32 large-sized 
blurred color images in our test set. These images are generated from carefully 8 
selected large-size HD images with resolutions ranging from 0.7 megapixels to 3.3 
megapixels. In order to generate more test images, we use the four blur kernels in 
[GF12] and [LWDFO9] to convolute the 8 images separately to generate a blurred 
image. As shown in Table 2.1 and Figure 2.7, each of the four blur kernels contain 8 
blurred images. Our datasets can represent most of the motion blur problem and can 
be extended based on Augmentor [BSH17]. All of our tests used a PC with a 64-bit 
Windows 10 operating system, an Intel i7 2.60 GHz CPU, 16 GB of RAM and an 


NVIDIA Geforce GTX 1060 GPU. 


Table 2.1 Average Evaluation Results on Large-Size Blurry Image Dataset 


Avg. Time 3515.97 


Avg. Rank 2.375 


ER 2.09; 


Kernel 1 | KS 1.04, 


3395.80, 


1.39, 


Kernel 2 1.83, 


3445.94, 


1.36; 


Kernel 3 1.13, 


3534.63, 


1.30, 


Kernel 4 2.04, 


3687.53, 


Notes: The method name with’ are utilizing our SalientPatch strategy. 
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Figure 2.7. The Comparison of The Kernel Similarity and The Computing Time 


2. Quantitative and Quality test. 

To compare the performance of algorithm, we evaluated three currently 
accepted outstanding deblurring algorithms (PanDeblurring [PSPY16], Krishnan 
Deblurring [KTF11], and KoteraDeblurring [KSM13]) based on three attributes: ER 


(error ratio), KS (kernel similarity) and Computing Time (in seconds). Levin et al. 
[LWDFO9] first defined ER as Ib — Toll /Mieg —1,||" , where J, is a recovered 
image, I, is the ground-truth clear image, and Ik, is the deblurred image with the 


groundtruth kernel k,. ER metric is often used to describe the quality of image 
restoration, where the lower value the better. 

Hu et al. [HY15] first defined the KS metric by calculating the maximum response 
of the normalized cross-correlation. It is used to measure the degree of similarity 
between the estimated blur kernel and the groundtruth kernel. The KS metric is 
specifically used to evaluate the quality of blur kernel estimates. A higher KS value 
indicates a higher accuracy of estimation. For a fair comparison, we chose the same 
non-blind deblurring algorithm [PHSY14] to recover the blurred image generated by 
the four different estimated kernels. The experimental results are as shown in Table 
2.1 and Figure 2.7 showed that all the existing methods based on SalientPatch 
significantly improve their estimation accuracy. In addition, our SalientPatch 
deblurring strategy reduces the amount of input data. As shown in Table 2.1, the 
kernel estimation speed of our deblurring algorithm is 9 times higher than that of the 


common algorithm. In the evaluation, we set the blur kernel to the same size. So the 
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calculation time is independent of the kernel size of the ground instance, but 
proportional to the image size. And because our SalientPatch contains a wealth of 
foreground information, it improves accuracy in the same amount of time. The 
specific data of ER, KS and computation time (in seconds) for each image in the 
large-size blurred image dataset in Tables 2.2, 2.3, and 2.4 strongly demonstrates the 
advantages of our algorithm. In addition, more visual effects of our large-scale 


blurred image dataset are shown in Figure 2.8. 


(d) 


Figure 2.8 Comparison of SalientPatch Effectiveness 
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Notes: More visual results on our dataset show the effectiveness of SalientPatch to the blur 
kernel estimation. The left pictures in (a)~(d) are blurry images from the convolution of the same clear 
image with four different blur kernel respectively, and the yellow rectangle in each picture denotes 
the location of our selected SalientPatch. The right column shows corresponding restored images. 
Kernels on the upper-left corner in the left and right pictures are the groundtruth kernel and the 
estimated kernel based on our SalientPatch respectively. The PSNR value of four restored image from 


top to down are 28.13, 27.87, 29.34, and 26.97 respectively. 


3. Comparison of deblurring effects. 

To better illustrate the advantages of our algorithm, we have further compared 
existing advanced blind deblurring methods based on region selection ([FSHO6] and 
[HY15]). [FSHO6] utilize the highest variance and lowest saturation to search the 
region for kernel estimation. [HY15] select the suitable region by scored all sliding 
windows using a pre-trained CRF model. Based on the above given patch, we use the 


[KTF11] algorithm to estimate the blur kernel. 


Figure 2.9 Kernel Estimation from Different Patches 


Notes: 

(a) Blurry input and true kernel; (b)~(d) show the selected patch with the yellow rectangle and 
the corresponding estimated kernel on the upper-left corner by [HY15] (available at 
https://eng.ucmerced.edu/people/zhu/GoodRegion.html), [FSHO6] (available at https://cs.nyu.edu 
/~fergus/research/deblur.html) and our SalientPatch respectively. Our SalientPatch method can obtain 


the most accurate kernel compared to other methods. 


As shown in Figure 2.9, it displays the blur kernel estimation based on different 
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selected patches and corresponding deblurring result. We can found that the same 
blurring image can obtain different deblurring results by using the different blur 
kernel obtained in different patch. Obviously, the kernel estimation based on 
SalientPatch can better recover the image compared with other methods. For fair 
comparison, the size of the selected patches are all fixed to 240 xX 240 similar with 
[HY15]. Figure 2.10 further clearly demonstrates that our SalientPatch region is 
better for blur kernel estimation than other regions, where sub-figure (a) shows the 
blurring image and groundtruth blur kernel, and (b) shows the deblurring result and 
blur kernel estimation based on random selected patch, (c) shows the deblurring 
result of our method. Our proposed SalientPatch based deblurring method can 
achieve higher quality image recovery than current state-of-the-arts full image based 


methods, especially for image foreground recovery. 


(a) (b) (c) 


Figure 2.10 Deblurring Results by Using Different Kernels 


As shown in Figure 2.11, we compare the recovery results of foreground objects 


based on groundtruth blur kernel, patch based blur kernel and full image based blur 
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kernel. Among them, the kernel estimation algorithm is based on [PSPY16] and the 


image recovery is based on the non-blind deblurring algorithm [PHSY14]. 


(e) (f) 
Figure 2.11 Comparison on Patch- and Full Image- based Deblurring 


Notes: 
(a) Blurry Input and Patches; (b) Different Blur Kernels; 
(c) Recovered with The Bad Patch; (d) Recovered with The True Kernel; 
(e) Recovered with Our SalientPatch; (f) Recovered with The Full Image. 


In Figure 2.11, it can be seen that the SalientPatch-based approach can better 
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restore bird feather details (Figure 2.11 (e)), which is closer to the details based on 
groundtruth kernel (Figure 2.11 (d)). This example proves our view that the smooth 
background area in the complete image does affect the accuracy of the blur kernel 
estimation. With the continuous development of deep learning technology, more 
and more image deblurring techniques based on deep learning have been proposed 
such as [XRLJ14], [HKZv15] and [Cha16]. And many of them have achieved higher 
precision image recovery. However, our SalientPatch method still has a strong 
competitive edge compared to deep learning methods. 

First, deep learning must be based on a large amount of appropriate training. 
This means that we have to build a huge blurring image training set that will be a 
very time consuming and labor intensive task. Instead, our framework takes only a 
single image as input and does not require training data set. It can be seen that our 
method is efficient and easy to use. 

Second, the deep learning method limits the size of the input image, and 
generally requires clipping or down-sampling for large images. These operations can 
seriously affect the accuracy of image recovery. Instead, our method can accept 
images deblurring within any size. Especially for large-size images, it can show our 
advantages. 

Third, our method is universally applicable and can be used as a preprocessing 
process for any deblurring algorithms. In this way, our method can also be applied to 
deep learning networks, and even improve the deblurring quality of existing deep 
learning methods. As shown in Figure 2.12, We further compare the [PSPY16] 
method based on our proposed SalientPatch with the current outstanding deep 
learning algorithm [XRLJ14], [HKZv15]. All input image sizes are fixed to 2226x1474. 
At the same time, Figure 2.13 shows the deblurring results for [XRLJ14], [Cha16] and 
our method based on small-size images (image size 600X600). Obviously, our 
algorithm can produce quite good results for both small and large images, and the 


accuracy is even better than the deep learning method. 
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Figure 2.12 Large Image Deblurring between Deep Learning and Our Method 


Notes: The PSNR for (b), (c), and (d) are 21.3193, 19.7259, and 30.736 respectively. 
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Figure 2.13. Small Image Deblurring between Deep Learning and Our Method 
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Table 2.2 The Evaluation Results about Error Ratio (ER) 


Average 

riding+k13 
riding+k17 
riding+k23 
riding+k25 
coffee+k13 
coffee+k17 
coffee+k23 
coffee+k25 


signt+k13 


signt+k17 


signt+k23 


signt+k25 
beach+k13 
beach+k17 
beach+k23 
beach+k25 
bike+k13 
bike+k17 
bike+k23 
bike+k25 


story+k13 


story+k17 


storyt+k23 


story+k25 


table+k13 
table+k17 
table+k23 
table+k25 
walit+k13 
walit+k17 
walit+k23 
walit+k25 


Notes: ER is based on large-size blurry image dataset. The 6 columns from left to right are obtained from 6 
algorithms: PanDeblurring , PanDeblurring[PSPY16], KrishnanDeblurring , KrishnanDeblurring [KTF11], 
KoteraDeblurring , and KoteraDeblurring[KSM13]. Methods with’ are SalientPatch based and the other three are 
full image based. Each row denotes a blurry image (clear image convolved with kernel) in our dataset. The 


average of each method's results are shown in second row and relatively better results are in bold. 
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Table 2.3 The Evaluation Results about Kernel Similarity (KS) 


Average 

riding+k13 
riding+k17 
riding+k23 
riding+k25 
coffee+k13 
coffee+k17 
coffee+k23 
coffee+k25 


signt+k13 


signt+k17 


signt+k23 


signt+k25 
beach+k13 
beach+k17 
beach+k23 
beach+k25 
bike+k13 
bike+k17 
bike+k23 
bike+k25 


storyt+k13 


storyt+k17 


storyt+k23 


storyt+k25 


table+k13 
table+k17 
table+k23 
table+k25 
wali+k13 
wali+k17 


walit+k23 
walit+k25 


Notes: {KS is based on large-size blurry image dataset. The 6 columns from left to right are obtained from 6 
algorithms: PanDeblurring , PanDeblurring[PSPY16], KrishnanDeblurring , KrishnanDeblurring[KTF11], 
KoteraDeblurring , and KoteraDeblurring[KSM13]. Methods with’ are SalientPatch based and the other three are 
full image based. Each row denotes a blurry image (clear image convolved with kernel) in our dataset. The 


average of each method's results are shown in second row and relatively better results are in bold. 


50 


Chapter 2. Image Deblurring as Pre-processing 


Table 2.4 The Evaluation Results about Computing Time (sec) 


Average 3515.97 
riding+k13 1373.78 
riding+k17 1475.33 
riding+k23 1405.83 
riding+k25 1471.47 
coffee+k13 1567.52 
coffee+k17 1564.90 
coffee+k23 1577.00 
coffee+k25 1576.07 
signt+k13 1974.42 
signt+k17 1713.14 
signt+k23 1755.63 
signt+k25 1700.47 
beach+k13 4544.30 
beach+k17 5075.61 
beach+k23 5192.69 
beach+k25 5030.81 
bike+k13 3840.77 
bike+k17 3885.77 
bike+k23 5259.17 
bike+k25 4746.08 
storyt+k13 7163.07 
storyt+k17 6642.36 
storyt+k23 6557.46 
storyt+k25 6642.36 
table+k13 2527.07 
table+k17 2583.65 
table+k23 2462.12 


table+k25 3150.77 


wali+k13 4175.44 
wali+k17 4626.75 
wali+k23 4067.18 


wali+k25 5182.19 


Notes: The 6 columns from left to right are obtained from 6 algorithms: PanDeblurring , PanDeblurring 
[PSPY16], KrishnanDeblurring , Krishna nDeblurring [KTF11], Kotera Deblurring , and KoteraDeblurring [KSM13]. 
Methods with’ are Salient Patch based and the other three are full image based. Each row denotes a blurry image 
(clear image convolved with kernel) in our dataset. The average of each method’s results are shown in second row 


and relatively better results are in bold. 
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4. Limitation. 

Our SalientPatch selection strategy must be based on the correct exclusion of 
image backgrounds and smooth shadows. But inevitably, for some complex special 
images, the foreground and the background are inseparable. In this case, our 
SalientPatch selection may be confusing because each sub-window will produce a 
similar kernel estimate, which will result in an inability to recover the image properly. 
Therefore, for complex images where foreground objects are not obvious, our 
method can only choose a rough square area to estimate the blur kernel. In any case, 


our SalientPatch strategy will still work, as shown in Figure 2.14. 


(a) (b) (c) 


Figure 2.14 Image Deblurring in Complex Scene 


2.5 Conclusion 


A large number of experiments show that not all pixels or sub-windows in the 
image can promote high accuracy image deblurring. In order to find the most 
suitable region, we proposed a new concept "SalientPatch" to automatically locate 
the fine square region for better kernel estimation. First, the input image will convert 
into superpixels by using multi-level segmentation. Then, multi-layer interest images 
are generated based on three related cues (object probabilities, structural richness 
and local contrast). Finally, the SalientPatch is located based on the binarized interest 


image. The experimental results demonstrated that our SalientPatch based blind 
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deblurring algorithm is superior to the current outstanding blind deblurring methods 
such as [KTF11, KSM13, PSPY16] in operational speed and deblurring accuracy, 
especially for the recovery of large-size images. At same time, it effectively improve 
the quality of the input images for stereo matching, which can significantly promote 
higher precision depth image estimation for our proposed multi-view based 3D 
reconstruction. 

In the future, we intend to design a non-rectangular SalientPatch to adapt to 
different shapes of objects. At the same time, our SalientPatch will be further 
optimized to include more a priori information. We hope to design blind deblurred 
parallel algorithms to achieve real-time operations, which can be adapted to practical 


applications. 
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Pedestrian Detection as A Priori 


3.1 Introduction 


With the continuous development and advancement of computer vision and 
artificial intelligence, pedestrian detection technology has been one of the hottest 
topics in this field. In the past decade, researchers have been working on designing a 
pedestrian detection algorithm that can be used in practical applications. 
Unfortunately, there are still no algorithms that can satisfy both high-precision 
detection and fast calculation. In this paper, our research goal is to solve the above 
bottleneck problems and propose a strategy that can effectively balance the accuracy 
and efficiency of detection. We designed a architecture similar to state-of-the-art 
multi-class detector SSD, and proposed a real-time pedestrian detection algorithm 
based on multi-layer convolutional neural network. In order to improve the recall 
rate of pedestrians at different scales, we use large feature maps generated in the 
lower layers of the full-convolution network to detect small-scale pedestrians, while 
the small feature maps in higher-level are responsible for large-scale pedestrians. 
Based on the statistical study of the general pedestrian shape, we proposed a new 
default box in training process, which can significantly reduce the miss rate. At the 
same time, the classifier in loss function is also been simplified for better adapting 
pedestrian detection tasks while significantly accelerate the operation speed. 
Evaluation under the current most authoritative pedestrian detection test set Caltech 
Benchmark, our algorithm achieved an average miss rate of 11.88%. More 
importantly, we achieved real-time detection and location of pedestrians (speed at 
20 fps). Obviously, the detection accuracy and running speed of our algorithm can 


fully satisfied the requirement of future intelligent applications. 
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3.2 Related Work 


As an important topic in the field of artificial intelligence and computer vision 
domain, pedestrian detection algorithms have always been the core technology for 
most navigation and public safety applications, and are widely used in practical 
applications such as assisted driving, intelligent monitoring and automatic driving. As 
the demand for these smart products continues to increase, the requirement of the 
accuracy and efficiency of pedestrian detection algorithms becomes harsh. In order 
to improve the performance of the algorithm, the existing improved methods can be 
summarized into two categories: 

(1) Designing more advanced artificial features or using multiple pedestrian 
features, such as checkerboard [ZBS15]; 

(2) Applying deep learning methods, which combines artificial features with 
convolution features such as compACT-Deep [CSV15], RPN + BF [ZLLH16], and 
DeepParts [TLWT15]. 

However, most current pedestrian detection algorithms are unable to achieve a 
perfect balance between computational speed and detection accuracy. 

Recent state-of-the-art algorithms introduce deep learning model to accurately 
detect pedestrians. They learn the characteristics of pedestrians through 
convolutional neural networks (CNN) for identification and localization, and have 
greatly improved the detection accuracy compared with traditional artificial 
feature-based algorithms. Recently, researchers have proposed many excellent deep 
learning-based methods, such as Zeng et al. [ZOW13] combined traditional artificial 
features (HOG + CSS + SVM) with CNN. They hope to utilize CNN to identify 
pedestrian candidate windows to improve detection accuracy. However, their 
methods still use traditional features, which do not effectively improve the detection 
performance. Ouyang and Wang [OW13] proposed the integration of four 
components (CNN, partial detection, deformation model and visibility reasoning) into 
a deep learning framework. By establishing an automatic interaction between 


components, the method further reduces the 6% miss rate ([ZOW13] miss rate is 
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45%). It can be seen that the above methods based on deep learning cannot 
effectively improve the performance of pedestrian detection. To further improve 
detection accuracy, the DeepParts model [TLWT15] uses CNN and part models to 
solve pedestrian occlusion problems in real scene. Unfortunately, this strategy does 
not take into account the time complexity of the algorithm. In order to solve the 
problem of low efficiency, Girshick et al. first proposed the R-CNN concept [GDDM14], 
which uses region selection to reduce the search range and significantly improve the 
speed of the CNN network. It lays a good foundation for the subsequent efficient 
detection algorithm. Later, they proposed the fast R-CNN to further speed up the 
operation by using the region of interest and the multitasking loss function in a single 
training phase. However, experiments have shown that fast R-CNN is not suitable for 
pedestrian detection tasks. Ren [GRGS15] attempted to upgrade the fast R-CNN 
framework and designed a Regional Proposal Network (RPN) to assist in finding the 
right search area. Their approach significantly improves the speed of pedestrian 
detection, but still does not improve the robustness of detection. Clearly, existing 
strategies have prevented the development of many pedestrian-based detection 
applications. 

It can be seen that most of the current algorithms focus on utilizing single 
feature map to train the network. This strategy clearly does not suitable for 
dynamically changing pedestrians. The recent most well-recognized object detection 
algorithm is the single shot multiple boxes detector (SSD) [LAE16], which design 
multi-scale feature maps from different convolution layers to adapt to different 
scales and different categories of objects. The algorithm can simultaneously provide 
fast and high quality object detection. Inspired by multiple convolutional layer 
feature map detection, Cai et al. [CFFV16] proposed an MS-CNN network that further 
improves the detection accuracy of small-scale objects by utilizing an intermediate 
network layer. However, multi-category detection strategies are not suitable for 
pedestrian detection. Since the size of pedestrians is changing dynamically in real 
scene, it will easily lead to category misjudgment from the classifier, which will also 


increase the error detection rate of pedestrian. In summary, for the special object of 
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pedestrians, whether we apply a single artificial feature map method or a multi-scale 


feature map from different layers based method cannot provide stable detection. 


3.3 Differences and Contributions 


It can be seen from the above analysis that the method of pedestrian detection 
using single feature mapping cannot adapt to the scale of change. However, 
multi-class detectors based on multi-scale feature maps cannot provide sufficiently 
robust detection results due to category misjudgment. In order to get rid of the 
current predicament, our research inspiration mainly comes from two work contents: 
pedestrian shape and scale statistics [DWSP12] and SSD (multi-scale and 
multi-category object detection) algorithm. In our proposed pedestrian detection 
algorithm, feature maps generated by different convolutional layers are fully utilized 
and perform end-to-end pedestrian detection in a full convolutional network. In 
order to improve the detection accuracy and improve the calculation speed, we first 
designed a multi-scale feature pyramid default box for network training based on 
pedestrian characteristics. Secondly, we modified and optimized the multi-class 
detection loss function of the original SSD to be more suitable for pedestrian 
detection. In the end, a large number of experiments prove that our algorithm can 
provide real-time pedestrian detection in real scenes, and the detection accuracy is 
more competitive with the existing advanced detection algorithms. The contributions 
are as follows: 

(1) We inherit the characteristics of SSD algorithm in accurately identifying the 
different scales of objects, and transform a multi-category object detector into an 
excellent pedestrian detection algorithm through optimization processing. Our 
detectors enable real-time pedestrian detection for images or video, and our 
detection accuracy even exceeds the original SSD algorithm and other related prior 
human detection algorithms. 

(2) We utilize each convolution layer to generate feature maps in different sizes 
to form a multi-scale feature pyramid. In this pyramid, low-level feature maps are 


used to detect smaller-scale pedestrians in the scene, and high-level feature maps 
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are responsible for large-scale pedestrians. This feature map scheduling strategy 
significantly improves the detection accuracy of the algorithm for pedestrians in 
varying scales. 

(3) In order to reduce the prediction miss rate during the network learning 
process, we created a new default box that better matches the pedestrian structure 
properties. Our design not only improves the speed of detection but also facilitates 
the accuracy of pedestrian detection. 

(4) We simplified and accelerated the SSD classifier so that only the background 
and pedestrians of the scene were distinguished. Our proposed improved loss 
function is more suitable for pedestrian detection tasks and with lower structure 


complexity and lower error detection rates. 


3.4 Proposed Method 


3.4.1 Overview 

Unlike other types of object detection, the purpose of pedestrian detection is 
very simple. For an image, pedestrian detection only needs to accurately identify the 
background and pedestrian categories. Although each pedestrian has its own unique 
attributes, such as different gaits and wearing, different standing positions, different 
body types and skin colors. But most normal pedestrians perform similarly and can 
be considered to be the same feature. In fact, for pedestrians in real scenes, smaller 
pedestrian sizes, erratic scales, and occlusion are the most difficult problems for 
pedestrian detection as shown in Figure 3.1. In this article, we focus on searching the 
ways to solve the above problems. 

We hope that we can simultaneously consider the algorithm's ability to identify 
multi-scale pedestrians, the ability to deal with occlusion and the complexity of the 
algorithm. Then we establish a new pedestrian detection algorithm based on 
multi-level convolution features. The image segmentation algorithm FCN indicates 
that the feature map can improve the segmentation effect because the image 
information of the lower layer remains more details. Therefore, we use low-level 


feature maps to deal with small targets and high-level targets for large targets as 
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shown in Figure 3.2. This strategy effectively guarantees the accuracy of small-sized 


pedestrians detection. 


Figure 3.1 Different Scales of Pedestrians in Typical Street Scene 
Notes: The images are sampled from typical street scene in the Caltech Pedestrian dataset. It 
is clearly that the scales of pedestrians are totally different in the same scene. Furthermore, the 


pedestrians are all in a dynamic state, which will also easy to lead to occlusion problem. 


, 


Figure 3.2. Convolutional Network for Pedestrian Detection 


We increased the number of convolutional layers in the VGG16 network 
architecture [CPK14] in order to generate more scale feature maps for pedestrian 


detection at different scales. As shown in Figure 3.3, we construct feature pyramids 
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with different resolutions in each layer of the network to calculate the confidence 
and position of the prediction box. Generally, the feature map of the high-level CNN 
network is relatively large, and the subsequent network layer will use pooling to 
gradually reduce the feature map size. We use larger feature maps and smaller 
feature maps for testing. The advantage of this is that larger feature maps can be 
divided into more cells, so small scale a priori boxes can be used to detect relatively 
small targets. In contrast, the small feature map has a larger a priori box and is 
therefore responsible for detecting large targets. However, because the 
characteristics of the high-level network are more figurative, it will affect the speed 
of prediction. After a lot of experiments, we finally decided to ignore all high-level 
features and only select specific low-level features for pedestrian detection. It can be 
seen that our detection strategy makes full use of different levels of features, while 
reducing the complexity of the algorithm compared to the detection algorithm based 


on single feature map. 
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Figure 3.3 Feature Pyramids in Our Network 


Notes: Seven feature maps with different resolution are used for pedestrian detection. 


3.4.2 Prediction Boxes in Different Layers 

Existing target detection algorithms typically use a variety of styles of rectangular 
prediction boxes to detect objects of different categories. For example, the SSD 
algorithm is representative of a multi-category object detector that assigns multiple 


prediction boxes to each feature unit with five aspect ratios (1:1, 1:2, 2:1, 1:3, and 3: 
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1). During the training process, multiple candidate frames are fitted and regressed 
with groundtruth to continuously optimize the size and position of the window, and 
finally find a positive sample. This method can effectively distinguish multiple kinds 
of objects in real scenes. However, the shape of pedestrians is special and their scale 
changes frequently. Direct use of the SSD method to detect pedestrians will only 
result in a higher rate of false positives. According to a large number of clustering 
experiments, YOLO v2 [RF17] found that the more similar the shape of the detected 
box is to the object, the higher the probability of being correctly detected. In 
summary, the existing aspect ratio can no longer better fit the shape of pedestrians, 
which is one of the factors that seriously affect the performance of pedestrian 
detection. 

Dollar et al. [DWSP12] first conducted an in-depth study about the structure of 
pedestrians. They used statistical models to observe that the shape distribution of 
pedestrians can be expressed by logarithmic formulation. After extensive 
experimentation with pedestrian samples, they finally defined the most aspect ratio 
as 0.41 that can best fit to any scale pedestrian structure. Since our algorithm is 
mainly focus on pedestrian detection, predictive boxes with single aspect ratio that 
best fits the shape of a pedestrian is good enough to complete the training of 
pedestrian detection. Meanwhile, it will also greatly accelerate the detection speed. 
Therefore, we abandoned the multiple aspect ratio prediction box similar to SSD and 
used the prediction box constrained by the single log-average aspect ratio. 

As with the existing multi-category target method, we compare the similarity of 
the new prediction box with the groundtruth box to find the best positive sample. 
Experiments prove that (see Table 3.1) the single aspect ratio prediction box strategy 
can not only significantly reduce the complexity of prediction, but also has a certain 
improvement in accuracy and operation efficiency compared with the original SSD 
algorithm. It can be seen that our algorithm is more suitable for practical 


applications. 
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Table 3.1 Performance Comparison Between Original SSD and Our Method 


SSD512 1:1,1:2,2:1,1:3,3:1 4,6,6,6,6,4,4 512x512 


ours 0.41 3,1,1,1,1,1,1 512x512 


Notes: Compared with original SSD, the AP (average precision) of detection is slightly improved after using 


our proposed boxes with a single aspect ratio. Meanwhile, the detection speed has accelerated about 2fps. 


Drawing on the idea of the anchor in Fast R-CNN and taking into account the 
characteristics of the pedestrians, we assign multiple a priori information in different 
scales and the same aspect ratio to the feature map cell from different layers. These 
predicted bounding boxes will consider the above priori and be calculated by using 
small convolutional filters (3 x 3) over several feature maps. Such a strategy can 
reduce the training difficulty, promote the regression of confidence and position, and 
improve the accuracy of pedestrian detection. 

To make our algorithm more convincing, we set our prediction box parameters 
according to the Caltech Pedestrian Benchmark standard. It can be seen that 15% of 
the pedestrians in the test set have a scale of 0 to 30 pixels, 69% of the pedestrians 
are 30 to 80 scales, and 16% are 80 pixels or more. In our pedestrian detection 
network, seven feature maps will be used to detect pedestrians (as shown in Table 
3.2), where the scale of the prediction box at the Conv4_3 layer is specified as 0.04, 
0.06 and 0.08 of the input image. From the Conv7 layer to the Conv12 layer we use 


the following formula to uniformly define the scale of the prediction box: 


ne + Fa Mase (k —1), ke[Lm|] (3.1) 


min 


Where Spin and Smgx represent the minimum and maximum scales of the 
prediction box, where the minimum scale is sufficient for small-scale pedestrian in 


real scene. 
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Table 3.2 The Scales of Prediction Boxes in Each Layer 


Scale 


Actual scale 


Actual height 


Actual width 


Once we confirmed the box scale, the size of the prediction box can be 


immediately calculated by using the bellowing formulas: 


Sk 
OM =8,/a,, hy, rs (3.2) 
r 


where a, is the fixed aspect ratio of boxes, w and h are respectively express as 


the width and height of boxes. Final, we further locate the box by calculating the 


i+0.5 j+0.5 


ee ie [0, Ifel] and |f| is the size 


center pixel of each prediction box is ( 


of the k-th feature map. 


3.4.3 Match and Training Objective 

In the training phase, we compare the generated prediction boxes with 
groundtruth. If the similarity to the groundtruth box reaches a certain threshold, we 
consider the current box to be a positive sample. Instead, it is defined as a negative 
sample. In our algorithm (as shown in Figure 3.4), a groundtruth will match multiple 
prediction boxes to avoid the problem of maximizing overlap and improve 
computational efficiency. 

Since the total proportion of pedestrians relative to the background image is 
small, there are only a few groundtruth boxes and a lot of prediction boxes. For each 
groundtruth in the image, you can only guarantee to find a prediction box that 
matches its maximum. However, most of the remaining boxes have become negative 
samples. This makes the total number of negative samples much larger than the total 
number of positive samples, which is not conducive to the training of pedestrian 


detection. So for the remaining unmatched boxes, if the similarity to a groundtruth is 
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greater than a certain threshold (usually 0.5), then the a priori box is also considered 
to match the groundtruth. Although a groundtruth can now match multiple 
prediction boxes, the number of groundtruth boxes is still relatively small compared 
to the a priori box. In order to ensure that the positive and negative samples are as 
balanced as possible, we used hard negative mining to re-screen negative samples. 
We only select the top-k as the negative sample of the training to ensure that the 


positive and negative sample ratio is close to 1:3. 
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Figure 3.4 Groundtruth Boxes Match with Multiple Prediction Boxes 
Notes: prediction boxes from multi-scale feature maps. The pedestrian with the red 
groundtruth box has matched the red prediction box from the 4 x 4 feature map, while the pedestrian 


with the blue groundtruth box has matched the prediction boxes from the 8 x 8 feature map. 


Multi-category object detectors typically use classifiers and indicators to 
distinguish the categories of predicted boxes. But this method also buried a hidden 
danger for the latter detection. Since multi-category object detection needs to 
provide high-quality object detection under the premise of correct classification, the 
detection rate of the loss function is restricted by the classifier. Moreover, the scale 
of pedestrians changes frequently, which makes pedestrians more prone to 
classification errors than other objects. Therefore, the existing multi-category target 
detection loss function such as SSD is not suitable for pedestrian detection. To solve 
this problem, we simplified the SSD classifiers and indicators to fit the characteristics 


of pedestrians. Filters are used in each prediction box to predict the confidence of 
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the pedestrian and calculate the shape offset. Similar to most object detection 
algorithms [ZOW13, OW13, LTWT14, JOBS15], we only consider confidence loss (conf) 


and localization loss (loc) to generate our loss function: 


joelXsl, @)) N#0 
0 N=0 


1 
W eons c)+F, 


F(x,c,l, g) = (3.3) 


Where N is the total number of all positive samples. We defined the Fronr() 


as a Soft-Max loss with confidence c: 


~ A ao, 4 _ Xp) 
F., (x%,c)=— > x,log(é)— Y log(¢?), 6, == (3.4) 
: x ; 2 Dexp)) 


where x;; is a label that is used to describe whether the current prediction box 
iis fit the groundtruth box j. Then , we defined Fj9-(.) as a smooth L1 loss between 


the prediction box / and the ground-truth box g: 


FA%l, g) = D3 ae - smooth, (1;" — g'') (3.5) 


i€P) 5,me{ cx,cy,a,h} 


In summary, our convolutional neural network training process can be roughly 
described as first input color pedestrian images with a groundtruth box. Then we 
generate multiple prediction boxes in the selected convolutional layer and match the 
groundtruth for positive and negative samples. The cost of the loss function is then 
calculated from the sample and propagated to the next network layer (see Algorithm 
1 for details). Among them, we need to pay attention to that all input images are first 
set to 3 channels and 512 x 512 size and then input into the network. We use the 
Jaccard overlap matching method to measure the similarity between the prediction 
box and the groundtruth box, and the overlap threshold is set to 0.5. Then, we use 
hard negative mining to guarantee the proportion of negatives and positives sample 


in 3:1. 
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Algorithm 1 The training process of the proposed algorithm 


© Input data: 
resize_img (width = 512, height = 512) 
groundtruth bbox = get_groundtruth bbox (xg, yg, hg, wg) 


© Define source layers: 


Ne Ga See aS ake 


default_params = Params (mbox_source_layers) = 

[’conv4_3’, ’conv7’, ’conv8_ 2’, ’conv9_ 2’, ’conv10_2’, ’conv11_2’, ’conv12_2’] 

6: feat_shapes = [(64,64), (32, 32), (16, 16), (8, 8), (4, 4), (2, 2), (1, 1)] 

7: Define prediction boxes: 

8: aspect_ratio = [0.41] 

9: step = int (((max_ratio — min_ratio) / (len (mbox_source_layers) - 2))) 

10: for ratio in xrange (min_ratio = 10, max_ratio = 90, step) do 

11: = min_izes.append (img_scale * ratio / 100) 

12: end for 

13: sizes = [[img_scale * 4 / 100, img_scale * 6 / 100, img_scale * 8 / 100]] + 
min_sizes 

14: Make prediction boxes: 

15: strides = [8, 16, 32, 64, 128, 256, 512] 

16: offset = 0.5 

17: for k in xrange (1, len (mbox_source_layers), step = 1) do 

18:  hd_k=size_k/math.sqrt (aspect_ratio) 

19: wd_k=size_k * math.sqrt (aspect_ratio) 

20:  yd_k, xd_k=np.megrid [0 : feat_shape [0], 0 : feat_shape [1]] 

21:  yd_k=(y.astype (dtype) + offset) * stride [k-1] / img_shape [0] 

22:  xd_k=(x.astype (dtype) + offset) * stride [k-1] / img_shape [1] 

23: end for 

24: return = yd, xd, hd, wd 

25: 4 Match and comput loss: 

26: FindMatches (groundtruth bboxes (xg, yg, hg, wg), default bboxes (yd, xd, hd, 
wd), jaccard_threshold = 0.5) 

27: return feat_labels, feat_localizations, feat_scores 

28: Hard_Examples_Mining (neg_pos_ratio = 3) 

29: > Forward and back propagate: 

30: loc_loss_layer_—> Forward (loc_bottom_vec_, loc_top_vec_) 


31: conf_loss_layer_—> Forward (conf_bottom_vec_, conf_top_vec_) 
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32: loss = summary (loc_loss, conf_loss) 


33: loc_loss_layer_— Backward (loc_top_vec_, loc_propagate_down, 


loc_bottom_vec_) 


3.5 Experimental Results and Evaluations 


3.5.1 Performance on Caltech Benchmark 


In order to prove the strength of our algorithm, we mainly use the current most 


authoritative Caltech Pedestrian Benchmark as the evaluation standard, which 


contains about 250,000 frames (about 137 minutes long) of real pedestrian scene 


video, 350,000 groundtruth boxes and about 2,300 pedestrians were accurately 


marked. In this paper, we designed our own test suite for training, which integrates 


carefully selected partial Caltech data and all image data from the ETH and 


TUD-Brussels datasets. Our data set contains a total of 27,021 pedestrian video 


frames. 
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Figure 3.5 Miss Rate Comparison with State-Of-The-Art Methods 


Notes: Current state-of-the-art methods on the Caltech set using a 


intersection-over-union (loU) threshold of 0.5. 
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We utilize made Stochastic Gradient Descent method to minor modifications to 
the parameters of the network model by setting the initial learning rate to 0.001 (in 
training, the learning rate will drop by 0.1 for every 10,000 training iterations) and 
assigning parameters of momentum, weight and maximum iteration respectively to 
0.9, 0.0005, and 40,000. We introduce False Positive Per Image (FPPI) to evaluate the 
performance of the pedestrian detection algorithm, which can well explain the 
deviation between the prediction result and the actual situation. FPPI indicates the 
frequency at which a negative sample is mistaken for a positive sample. The unit of 
evaluation is a log-average miss rate (MR) and the smaller the value, the better the 
performance of the algorithm. As shown in Figure 3.5, our algorithm is only slightly 
inferior to the top two RPN + BF [ZLLH16] and CompACT-Deep [CSV15] algorithms. 
However, our algorithm is not only high precision, but is very competitive if we 
consider indicators such as efficiency and complexity. Based on the most commonly 
used "reasonable" datasets, we compare the performance of the recent top 10 
pedestrian detection algorithms in the dataset with our proposed algorithm, which 
evaluates pedestrian detection capabilities for solving distance, size, and occlusion 
problems. The advantages of our method are demonstrated from Table 3.3, which 
also illustrates the use of convolutional feature can also achieve relatively high 


accuracy pedestrian detection. 


DeepParts 
Checkerboards 


Checkerboards+ 


In Table 3.3, pedestrian-scale performance comparisons are subdivided into 


near-scale, medium-scale, large-scale, and long-distance. The occlusion performance 
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comparison is divided into none occluded, partially occluded and heavily occluded. 
"All" indicates that the pedestrians in the image have all of the features listed above, 
which are extremely demanding on the performance of the pedestrian detection 
algorithm. However, our algorithm ranked first in the "all" case. This strongly proves 
that the performance of our algorithms is more powerful than existing advanced 
algorithms. 

The existing advanced algorithms have basically broken through the accuracy of 
pedestrian detection, and the accuracy gap of each algorithm is not obvious. 
However, the biggest bottleneck problem at present is that most existing methods 
are very inefficient. In particular, no single algorithm can achieve high efficiency and 
high precision pedestrian detection at the same time. However, in terms of speed, 
our algorithm inherits the high-efficiency characteristics of SSD and achieves further 
speed-up, which enables real-time pedestrian detection (20 fps). We compared 
several currently recognized high-efficiency pedestrian detection algorithms such as 
RPN + BF [ZLLH16] and CompACTDeep [CSV15]. As shown in Table 3.4, our algorithm 
is 10 times faster than them. The network we designed is nearly three times faster 
than the current popular Faster R-CNN, and it also guarantees the accuracy of the 


detection. 


Table 3.4 Speed Comparison Between State-Of-The-Arts and Ours 


i | Hardware(GPU) | fps | 3 (%) | 
RPN+BF K40Z 9.58 
CompACT-Deep K40 Z 11.75 
Faster R-CNN K40Z 20.2 
Faster R-CNN (Optimized) K40Z 16.2 


SSD Titan XP ascal 20.3 


ours Titan XP ascal 11.88 


3.5.2 Performance on Scales and Occlusion Subsets 

The most important criterion for evaluating whether an algorithm has practical 
application capabilities is the ability of the algorithm to handle pedestrian scale and 
occlusion problems. In real scenes, the scale of pedestrians is generally small and 


accompanied by dense occlusion. Therefore, we mainly focus on the processing 
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power of the algorithm on smaller scales and dense occlusion. RocScale is set to far 
and medium as a pedestrian scale indicator. Most of the methods shown in Figure 


3.6 do not correctly detect pedestrians under this indicator. 
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Notes: The methods on Caltech-USA scale subsets. 


70 


miss rate 


10 


05 


miss rate 


05 F 


Notes: 


Figure 3.6 Occlusion Test Between Our Model and State-Of-The-Arts 


Chapter 3 Pedestrian Detection as A Priori 


rrr 99% VI 
= = 84% HOG — 
mr 42% Katamari 

= 41% SCCPriors 

= = 41% CCF 

= = 39% SpatialPooling+ 
——— 39% SpatialPooling 
—— 38% CCF+CF 

= = 33% TA-CNN 

mn" 31% Checkerboards+ 
= = 31% Checkerboards 
= = 25% CompACT-Deep 
rrr 24% RPN+BF 
= = 20% Ours 

=_™_ 20% DeepParts 


10° 10° 10°! 10° 10! 
false positives per image 


(a) Setting the RocOcc as partial 


|= = anos ee oe 


mn 99% VI 
= = 96% HOG ‘ 

= = 80% MT-DPM+Context Veg 

——79% SDN be 

= = 78% SpatialPooling+ * 

= 78% SpatialPooling 1 

= 78% Checkerboards+ " 

= = 78% Checkerboards ¥ 
~~ 74% RPN+BF “7 
—— 73% CCF+CF . 


= = 72% CCF ' 
= = 70% TA-CNN 

= = 66% CompACT-Deep 

= = 60% DeepParts 

= = 51% Ours 


ee ee a eS ee 


10° 10° 10°! 10° 10! 
false positives per image 


(b) Setting the RocOcc as heavy 


The methods on Caltech-USA Occlusion subsets. 


However, our algorithm's ability to handle small-scale pedestrians is significantly 


stronger than all other algorithms (remote detection rate is 39.7% MR, medium 
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detection rate is 77.3% MR). Compared to the algorithm closest to our score, we are 
15.7% higher than him. RocOcc is set to partial and heavy as a performance indicator 
for measuring occlusion problem. After testing, our algorithm under partial occlusion 
conditions (20.1% MR) is only slightly lower than the first-ranked DeepParts (19.9% 
MR). But in heavy cases, our results ranked first. In summary, our algorithm can 
better handle multi-scale and occlusion problems compared to existing excellent 


algorithms. 


Figure 3.7. Detection Results of CompACT-Deep and Ours on Scale Subset 


The numbers can clarify the facts clearly and rigorously, but the images can 
explain the problem more intuitively. We selected several currently recognized 
excellent test results images at Caltech Benchmark to compare with our results. 
Figure 3.7 shows the results of our algorithm and CompACT-Deep detecting 
small-scale pedestrians, where the groundtruth box is shown in red and the 
predicted box is green. It can be seen that our algorithm is closer to groundtruth. 


In order to evaluate the occlusion processing ability, the test images we 
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selected is heavily occluded, where the general algorithms are difficult to obtain the 
correct detection results. So we choose to compare the current best occlusion 
capabilities of DeepParts with our algorithm. As shown in Figure 3.8, the results of 
our algorithm are competitive with DeepParts. However, our algorithm has a high 
rate of false detections, especially in the second and third images, where the 
pedestrian overlap is very dense. This result shows that our algorithm still require 


further improvement. 


Figure 3.8 Detection Results of CompACT-Deep and Ours on Occlusion Subset 


3.5.3. Performance in Real-Life Scenes 

All previous evaluations are based on internationally recognized test sets. 
Although they are persuasive, all the videos used for testing are purposefully and 
carefully screened, and some special cases do not fully represent reality. Therefore, 
we first arranged several cameras for scene shooting at multiple angles. Then, a 
video of an indoor basketball game was randomly obtained for testing. Since indoor 
basketball usually involves severe occlusion and multi-scale pedestrians and poor 
lighting conditions in indoor scenes, this video is difficult for all pedestrian detection 
algorithms. At the same time, we also use algorithms for pedestrian detection in real 


environments such as subways, parks and campuses. It can be seen that factors such 
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as different viewpoints, lighting conditions and camera focal lengths will further have 
additional impact on pedestrian detection, which are not seen in the existing test set. 
Figure 3.9 and Figure 3.10 show the results of our pedestrian detection in various 
real-world scenarios. It turns out that our algorithm can provide real-time and higher 


precision pedestrian detection in practical applications. 


Figure 3.9 Detection Results in Basketball Games 
Notes: In the indoor basketball game with poor lighting conditions, our pedestrian detection 
algorithm still can accurately detect every athlete in such disordered environment with real-time 


speed. 


Figure 3.10 Detection Results in Real-Life Environments 


Notes: Our proposed pedestrian detection algorithm is tested in different scenes at different 


time. It is clearly that our algorithm can be applied for different practical applications. 
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3.6 Analysis of Our Model 


3.6.1 Analyze of Multi-Layer Detection 

In the above test, our pedestrian detection algorithm proved to have a high 
ability to deal with multi-scale and occlusion problems, and the detection results are 
impressive in each test set. For solving multi-scale problems, our proposed 
multi-convolution feature pyramid can utilize pedestrian semantic information from 
various layers and different scales. Although many algorithms have proposed 
strategies for applying convolution features to detect targets. Unfortunately, most 
algorithms only use the characteristics of the last layer of the network. Since the 
feature map corresponding to the upper layer is too abstract, it is not suitable for 
identifying small-scale objects. To further illustrate the benefits of using each layer of 
feature maps, we recorded the MR value of each layer during the training process to 
demonstrate the contribution of different layers to the accuracy of pedestrian 


detection (see Table 3.5). 


Table 3.5 Contribution of Different Layers in Our Network 


< 


Conv4_3 

Conv7 

Conv8.2 

Conv9.2 

Conv10.2 

Conv11.2 

Conv12.2 

MR rise compared with 


A} S|] A] A] x 


using all layers 


From the bottom of the network, we found that the layer before Conv4_3 did 
not improve MR but reduced performance, so we directly abandoned the low-level 
features. The features learned in the Conv4_3 layer can significantly reduce MR, 
indicating that the feature map of this layer needs to be treated with emphasis. From 
the Conv4_3 layer to the Convi12 layer, it is more or less helpful to improve MR. 
Experiments have shown that different feature maps affect the final quality of 


pedestrian detection. 
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3.6.2 The Effect of Training Data 

We have found that the selection of training data is also critical to improving the 
performance of pedestrian detection. As shown in Table 3.6, we first trained our 
network with a single Caltech Pedestrian data set. We finally achieved an average 
accuracy (AP) of 0.535 and a MR of 56%. We then attempted to extract the key 
frames from the Caltech Pedestrian dataset. The key frames are then merged with 
ETH and TUD-Brussels data into a new data set for training. We found that AP 
increased by 2.5% and MR decreased by 1.6%, which indicates that the correct 


training set selection can also promote the performance of the algorithm. 


Table 3.6 Effect Comparison Between Caltech dataset and Multi-dataset 


Caltech only 640x480 


Multi-dataset 


640x480 
(Caltech+ ETH+TUD) 


3.7 Conclusion and Future Work 


Existing algorithms still do not solve the problem of pedestrian multi-scale and 
occlusion well, and cannot simultaneously achieve fast and high-precision pedestrian 
detection in real-world scenarios. Our research focuses on solving the above 
bottleneck problems to realize the practical application of pedestrian detection 
algorithms. 

In this paper, we have designed a new full convolutional neural network for 
training pedestrian detection. We fully consider the characteristics of each network 
layer, and use low-level feature maps to detect small-scale pedestrians and high-level 
feature maps responsible for large-scale pedestrians. This strategy can improve the 
detection accuracy of multi-scale pedestrians and effectively solve the occlusion 
problem. In network training, we have proposed to use the aspect ratio of pedestrian 
feature statistics to constrain the prediction box. Our approach not only simplifies 


the complexity of the prediction but also significantly reduces MR. To further 
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improve the overall performance of our algorithm, we optimized the SSD loss 
function to remove the effect of the classifier on pedestrian detection. Experiments 
show that our algorithm can realize real-time and high-precision pedestrian 
detection, which can be directly used in practical applications. At same time, it 
accurately labels the position of dynamic pedestrian area. It effectively assists the 
stereo matching algorithm to avoid the influence of dynamic noise, which will 
significantly improve the quality of the depth estimation and accelerate the 
computation speed for our proposed multi-view based 3D reconstruction. 

In the future, we hope to further improve our proposed pedestrian detection 
algorithm by proposing a non-rectangular prediction frame that more closely 


matches the shape of the pedestrian, so that it can adapt to any harsh environment. 
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Chapter 4 
Depth Estimation as Core Tech 


4.1 Introduction 


Stereo matching or multiple-view matching has always been one of the most 
important research problems in computer vision because disparity (depth) 
estimation has played a crucial role in most computer vision applications, including 
depth of field rendering, consistent object segmentation, and multiple-view stereo 
(MVS). However, any of the current global or local stereo matching algorithms 
[HBGRO9, RHB11, ROD10] are insufficient to show matching accuracy and calculation 
efficiency during the matching processing, thereby rendering numerous stereo 
matching-based applications incapable of achieving their desired performance. Most 
of the current stereo matching-based multiple-view reconstruction methods [TSF11, 
LCDX09, BBHO8] suffer from low accuracy, model incompleteness, and time issues. 
Although many algorithms are being developed to achieve balance in processing 
precision and speed, several challenges still remain. Therefore, we proposed hybrid 
tree-guided PatchMatch and quantizing acceleration algorithm to obtain a dense and 
accurate disparity map in fast processing speed. First, an initial disparity map of 
hybrid tree cost aggregation is generated to constraint label searching range. Then, 
we provide a novel plane refinement strategy for PatchMatch stereo to accurately 
calculate the final disparity. Second, we further presented an effective quantizing 
acceleration strategy for matching cost computation of each continuous disparity. 
The main contribution of this paper is the seamless merging of two independent 
algorithms to efficiently address the problem of large label spaces while still 
maintaining or even improving the solution quality. Experimental results show that 
our proposed algorithm can truly generate high-quality depth images and achieve 
better efficiency compared with those of two original independent methods in 
Middlebury and KITTI benchmark evaluation. In addition, this algorithm is suitable 


for multi-views reconstruction in the real scene. 
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4.2 Related Work 


In general, four main parts are involved in the process of stereo matching [SSO2]: 
matching cost computation, cost aggregation, disparity estimation and disparity 
refinement (optional). Matching cost computation is the most intuitive method. 
Based on the difference between pixels, matching cost computation is used to obtain 
a cost volume (disparity space image, DSI). Then, cost aggregation is viewed as 
filtering over the cost volume, which plays a key role of denoising and refinement for 
DSI. The quality of the cost aggregation method has a significant impact on the 
success of stereo-matching algorithms. The disparities are computed with a local or 
global optimizer. 

Recently, Yang et al. [Yan14] proposed a novel bilateral filtering method for cost 
aggregation processing to estimate depth information, which can be very effective 
for high-quality local stereo matching. Owing to the cost aggregation in local support 
window, the local minimum problem cannot be avoided. Yang et al.[Yan12, Yan15] 
further solved the problem by using a series of linear complexity non-local cost 
aggregation methods, which extends the support window size to the whole image 
and constructs a minimum spanning tree (MST). According to the impact of global 
pixels, this method considerably improves computation speed and accuracy. In 
addition, Vu et al. [VCY14] extended the above MST based cost aggregation method 
and proposed a hybrid tree algorithm to estimate depth information by using pixel- 
and region-level MST, which can avoid the constructing errors of MST in a 
texture-rich region and improve the accuracy of the MST-based cost aggregation 
method. 

By contrast to the preceding stereo correspondence algorithms mentioned in 
one-pixel precision, patch match is designed to find out the similarity between two 
patches, which is well suitable for stereo matching and also achieve depth estimation 
in sub-pixel accuracy. Connelly Barnes et al. [BSFDO9] first proposed the patch match 
theory for image processing. They put forward the concept of patch, which treats the 


spatial pixels as a small flat. It will quickly generate the parity map by using an 
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iterative guessing of the nearest neighbor similar patch from both left and right view 
images. Along with the maturing of patch match technology, it has been widely 
applied in the stereo matching applications during this decade. Such as the recently 
proposed PatchMatch stereo [BRR11a] can directly estimate highly slanted planes 
and achieve notable disparity details. The structure of this algorithm is showed as 


Figure 4.1, which is similar with a part of our proposed stereo matching method. 


—_ —— 
Spatial propagation View propagation Temporal propagation 


(a) 


Figure 4.1 Different Steps of The Patch Match Stereo Algorithm 
Notes: 
(a) Left and right disparity maps at an intermediate step of the first iteration. Three types of 
propagation are illustrated by arrows. 
(b) Results after iteration 3. 


(c) Results after post-processing. 


Their results showed impressive subpixel results and rank excellently in the 
Middlebury benchmark [SSO2]. Given its high-quality advantages, this kind of 
patch-based method has also been extended with different considerations and 
requirements. Philipp Heise [HKJK13] developed a method that integrates the patch 
match stereo algorithm into a variational smoothing formulation using quadratic 
relaxation for stereo matching, which allows us to control the smoothness of the 
first-order and second-order derivatives of the disparities value. The experiment 
showed the proposed method can estimate sub-pixel accurate disparity maps during 
stereo matching. Shibiao Xu [XZH15] proposed a novel method for joint stereo 


matching and object segmentation by using a convex formulation of the multi-label 
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Potts Model with patch match stereo techniques to generate depth-map at each 
image in object level. The experiment showed that their method outperforms the 
traditional integer-valued disparity strategy as well as the original patch match 
algorithm and its variants in sub-pixel accurate disparity estimation. What is more, 
there are many more patch match based stereo matching methods [TMN14, LYMD13, 
BRK11] with different considerations and requirements. 

In Table 4.1, we provide the objective evaluation for this kind of methods with 
the Middlebury benchmark in sub-pixel accuracy. Currently, more and more 
researchers are focus on designing multi-view depth estimation by using PatchMatch 
method. Shen [She12] utilized the precise PatchMatch stereo to compute individual 
depth map, which can achieve a depth-map merging-based MVS reconstruction for 
large-scale scenes. Shen [She13] extended his previous work and further proposed a 
PatchMatch-based MVS method for large-scale scenes. The PatchMatch process is 
utilized to generate depth map at each image with acceptable errors and then 
enforced consistency over neighboring views to refine the disparity values. The 
experimental result proved that multiple-view reconstruction accuracy by using 
PatchMatch stereo is significantly better than other methods. Besse [BRFK14] 
proposed a PatchMatch belief propagation stereo matching algorithm, which 
improved the estimation accuracy of PatchMatch by using a global optimization 
strategy and achieved MVS matching in a 2-D flow field. However, these strategies 
often contain a large label space for continuous disparity estimation that is difficult 


to balance between computational efficiency and estimation accuracy. 
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Table 4.1 Evaluation for PatchMatch based Stereo Matching Methods 


GC+LSL[TMN14] 


PM-PM[XZH15] 


PM-Huber[HKJK13] 


PMF[LYMD13] 


PMBP[BRFK14] 


PatchMatch[BRR11a] 


ObjectStereo[BRK11] 


Notes: The objective evaluation is based on Middlebury benchmark in sub-pixel accuracy. 
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4.3 Differences and Contributions 

In this paper, we extend our preliminary work [XZD15] by applying hybrid 
tree-guided PatchMatch stereo matching and quantizing acceleration for a 
continuous disparity estimation framework. Therefore, the proposed disparity 
acquisition algorithm is based on the recently popular hybrid tree cost aggregation 
algorithm [VCY14] and the accuracy PatchMatch stereo algorithm [BRR11a]. In our 
work, an initial disparity map of hybrid tree cost aggregation is first generated to 
constraint label searching range. Then, we provide a novel plane refinement strategy 
for PatchMatch stereo to accurately calculate the final disparity. Second, we further 
presented an effective quantizing acceleration strategy for matching cost 
computation of each continuous disparity. It is worth noting that other advanced 
stereo matching methods (not limited to what we use) in one- and sub-pixel levels 
(sub-pixel and one-pixel is the precise level of disparity, which represent float-valued 
disparity and integer-valued disparity respectively) can also be integrated into our 
stereo matching framework, which means the above two independent algorithms 
can be seamlessly merged to efficiently address the problem of large label spaces 
while maintaining or even improving the solution quality. 

Our obvious differences and contributions include: 

- Two independent algorithms (hybrid tree cost aggregation and PatchMatch 
stereo) can be seamlessly merged to achieve MVS matching applications while 
maintaining or even improving the solution quality. 

- An initial one-pixel level disparity map is generated by hybrid tree cost 
aggregation to constrain the label searching range of PatchMatch stereo algorithm. 
The map not only significantly accelerates the estimation speed of PatchMatch but 
also improves the accuracy of disparity from one-pixel- to sub-pixel-level accuracy. 

- Instead of stepping through large label space for similarity computation, an 
effective quantizing acceleration strategy is proposed by generating a linear 
interpolation of matching cost between the two closest disparity values to yield high 


efficiency in cost computation. 
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- Empirical experiments on Middlebury and KITTI benchmark evaluation show 
that our algorithm not only provides faster and higher or competitively accurate 
disparity results than both original binoculars [VCY14, BRR11a] but is also suitable for 


multiple-view based 3D reconstruction. 


4.4 Proposed Method 


4.4.1 Overview 

In this section, we will further explain our method for disparity information 
acquisition from calibrated stereo image pair or multiple views. Our goal is to obtain 
a dense and high-quality disparity map while considering accuracy and efficiency. 

According to the computation principle of the original PatchMatch stereo 
algorithm [BRR11a], individual 3-D slant plane at each pixel is used to overcome bias 
problems during reconstructing fronto-parallel surfaces and extended to find an 
approximate nearest neighbor among all possible planes. These conditions help this 
method achieve remarkable disparity details with sub-pixel precision. However, the 
random initialization process may be inappropriate for a real scene, which is very 
likely to have at least one good guess for each disparity plane in the image. This 
observation is often false, especially for middle- or low-resolution images, in which 
each 3-D plane contains a small number of pixels, thereby indicating insufficient 
guesses. Furthermore, calculating the final disparity of each pixel with an iterative 
searching way from the random initialization is time consuming. 

In the framework of our proposed algorithm as shown in the Figure 4.2, pixels 
from stereo image pairs are ranged into superpixel regions. Then an initial pixel level 
disparity is generated by recent advanced hybrid tree cost aggregation algorithm 
[BRR1i1a]. Furthermore, a novel iterative plane refinement strategy inspired by 
PatchMatch algorithm [VCY14] is applied, where the label searching range and 
normal vector are both constrained with initial disparity values, to calculate the final 
sub-pixel level disparity. During the computation of PatchMatch, a quantizing 
acceleration strategy for continuous disparity cost computation is also proposed to 


significantly improve the algorithm efficiency by reducing the label number in 
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Figure 4.2 Overall Framework of Our Stereo Matching Method 


85 


Chapter 4 Depth Estimation as Core Tech 


sub-pixel level disparity estimation. We will deeply explain each core process in 


Algorithm 2. 


Algorithm 2. Quantizing Acceleration Strategy for PatchMatch 


1: Decompose input image pair into compact superpixels. 
2: cost computation and aggregation via tree filter: 

3: for each superpixel region S do 

4 for p € S (dis the corresponding disparity) do 

5: Pa = p — Gd, 0) 

6 Pcotor(P, Pa) = min(|llo(P) — ad lla 71) 

7 Pgraa(P, Pa) = min([lo(p) — Ly @ad lla, T2) 

8 Ca(p) = 1 — @) * peotor , Pa) + & * Pgraa(P, Pa) 
9 end for 

10: Cg(S) = Lines Ca(P) 

11: end for 

12: adaptive fusion of both level aggregated costs: 

13: for each superpixel region S do 

14: CE(S) = asC4(p) + (1 - as) C4(S) 

15: end for 

16: > reliable label searching range generation: 

17: for each pixel p (d is the initial disparity) do 

18: dmin = d—0.5— eps; dingy = A +0.5 — eps; n' = n+ Az 
19: end for 

20:  PatchMatch refinement and quantizing acceleration: 


21: for each pixel p do 
22: pa = Lx+i — Apr, + A - Le) Piss 


23: fp = argminger M(p, f) 
24: end for 
25: Output final disparity in sub-pixel accuracy. 


44.2 Initial Integer-valued Disparity Set from Hybrid Tree 
Given a pair of images Ij and I,, we denote I)(p) to be the intensity of pixel p 


in image Ig and denote 1,(pqg) to be the intensity of pixel pg in image [,. For 
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stereo matching, pq can be easily computed as pq = p — (d,0), with d being the 


corresponding disparity value. For multiple-view matching, pg is represented as 
Pa =H * [py Dy, d]’, where the homography H between images Ig and I, should 


be computed through their camera-intrinsic and external parameters. We will 
provide more details to calculate homography. 

With a set of images from different multiple views, selecting a suitable reference 
image from the input image set is necessary to form a stereo pair for further 
disparity estimation. We utilize a method similar to [LLC10] to select eligible stereo 
pairs to obtain a good reference image with similar viewing direction as the target 


image and avoid extremely short or long baseline. Based on the assumption that the 
camera parameters of the image pair are {K;,R,,7T;} and {Kj, Rj, T;}, the K; and 
kj are the two camera-intrinsic parameters [Zha00], whereas R and T are the camera 


rotation and translation parameters relative to the world coordinate system, 


respectively. Then, the resulting induced homography is: 
= -1 -1 1 
A, = K [RR; + (7; —RR, T, )|K; (4.1) 


We calculate the pixel- and region-level dissimilarity cost for the hybrid tree cost 
aggregation [VCY14]. A 3-D structure cost volume [Zha00] is defined as D(x, y, d), 
where x and y are the current pixel coordinates, and d is the user-given disparity level 
to store the matching costs between input images at each discrete integer-valued 
disparity level. The pixel-level matching cost Cg(p) is defined as the dissimilarity 
between pixel p and pixel pg, which is given by the linear combination of the color 


dissimilarity and the gradient difference: 


C,(p) =(1-a@) min(lJ,(p) -1,(p,) 


,7,)+amin(VI,(p)-VI,(p,) 


,T>) (4.2) 


where a is used to balance the color and gradient terms, and T, , Tz are the 
truncation values. In addition, we utilize general SLIC algorithm (Simple Linear 


Iterative Clustering) [ASS12] to over-segment the input images and range each pixel 
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into its corresponding superpixel region S. Then, the region-level cost Cg(S) = 
Yipes Cqa(p) under each disparity level can be defined as the sum of all the pixel-level 
matching costs in the same superpixel region S. 

The tree filtering-based cost aggregation [Yan15] is also necessary to reduce 
noise in pixel-based multiple-view matching while computing the lowest cost. After 
calculating the matching cost for each pixel, we apply cost aggregation from the 
hybrid tree to achieve an adaptive fusion of the pixel- and region-level aggregated 
costs Ci (S) = asCf(p) + (1 — as)C#(S), where parameter as is an edge density 
of the region S (for more details on the adaptive fusion of both level costs, please 
refer to [VCY14]). Based on the winner-take-all (WTA) method [SS02], we can find the 
true minimum cost and the resulting integer-valued disparity set Dy; will be 
applied as a range constraint to improve the refinement quality of PatchMatch 
algorithm. 

4.4.3. The Theory of Patch Matching 

The function of patch matching is to find out the nearest neighbor patch in 
image B (source image) from every overlapping patches in image A (target image). In 
other words, patch matching is just a randomized proximity or similarity search 
algorithm, where the patch is a square region centered on a current pixel and the 
repeated search of all the squares in one image region and for the most similar 
squares in another image region. Patch matching algorithm can be summarized as 
three important procedures: initialization, propagation and random searching, as 
Figure 4.3 shown. Here we will describe how exactly the patch matching works. 

(1) Initialization 

Before searching the nearest neighbor patches from image A and B, patch match 
algorithm will first be assigned random or prior information as the initializing offsets 
guess for the later searching process (the random offsets are based on the 
independent uniform samples across the full range of image B). The main theory of 
initiation process insights that motivate this algorithm is that it searches in the space 
of possible random offsets, and then improves them through adjacent offsets search 


cooperatively, and that even a random offset may be a good guess for many patches 
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over a large image. Patch match defines the nearest neighbor field (NNF) as a 
function f(A) > R2, while it defines over all possible patch coordinates (locations of 
patch centers) in image A and offset of two patches v. Assuming the patch coordinate 
a in image A and its corresponding nearest neighbor patch coordinate b in image B, 


the result of f(a) = b —a = v is the offsets between the two patches. 


(a) Initialization (b) Propagation (c) Search 


Figure 4.3 Steps of The PatchMatch Algorithm 
Notes: 
(a) Patches initially have random assignments; 
(b) The blue patch checks above/green and left/red neighbors to see if they will improve the 
blue mapping, propagating good matches; 


(c) The patch searches randomly for improvements in concentric neighborhoods. 


(2) Propagation 

After initiation process, patch match algorithm will run an iterative update 
process to the NNF, in which good patch offsets are propagated to adjacent pixels, 
followed by random search in the neighborhood of the best offset found so far. In 
propagation, it improves the value of f(x, y) by using the known offsets of 
f(x-—1, y) and f(x, y—1). Each initiation offsets are examined in scan order. If 
the current location of patch a is (x, y) with offset f(x, y) in image A and 
correspond to the patch b in image B, then it will check together with a and b 
patches to find if any of them can improve the matching degree of patch a (see 


Figure 4.3 (b)). The matching degree is evaluated by patch error function E(v) and 
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it takes the update value for f(x, y) to be the argmin of {D[f(x, y)], DIf( —-1, 
y)],DUF@, y — 1)] }. 

(3) Random Search 

In the iterative update process, only propagation process will lead to local region 
minimum problem. It is harmful to find a true nearest neighboring patch. So in each 
iteration step, each propagation procedure will be followed by a random search 
procedure to avoid the minimum problem. In random search, it will attempt to 
further improve the value of f(x, y) =v by testing a sequence of candidate 
offsets at an exponentially decreasing distance from vo (see Figure 4.3 (c)). The 
candidate offset is generated by an exponential distribution sampling function: 
Vi = Vo + wa;,R;, where the R; is a uniform random number in [-1, 1]x [-1, 1], w 
is a large maximum search radius, and a is a fixed ratio between search window sizes. 
The process will examine each candidate patches for i = 0, 1, 2 ... until the current 
search radius wa; is below one pixel. If there are any candidate offsets that can 
improve the current patch offset, the new offset will be replaced instead of the old 


one. 


4.4.4 PatchMatch with Disparity and Normal Constraints 

Although we have obtained the optimal initial disparity value set, we still cannot 
treat them as the final results. All the processing that was previously mentioned aim 
to obtain a label (disparity and normal) guessing range to serve the global 
PatchMatch label refinement method that will be explained in this section. The goal 
of this part is to refine the disparity results to further reduce the matching cost. 

Based on the optimal initial disparity value d € D,,; for each pixel, we first 
compute the reliable disparity searching range for matching all pixels, which includes 
d, admin, dmax, and other three random disparity values of the interval [dmin, 
dmax| for each pixel. The computation can be accomplished by setting a minimum 
Admin = d—0.5 and maximum allowed disparity ding, = d +0.5 —eps, where 
eps is a very small value. Furthermore, we compute the reliable normal searching 


range from the current normal vector 1 defined on the initial disparity values of 
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pixels, and the allowed rotation Az is constrained in [—30, 30°] (the given angle 
range of normal vector refers to the work of [HCO4] depth-map computation. Thus, 
we now estimate different random normal 7n’:=n+Az _ for the preceding 
reliable disparity values. Then, we run the global PatchMatch algorithm to refine 
disparity and normal as shown below. 

For each pixel p of the image pair, we hope to find a plane f,, which is one of 
the minimum aggregated matching costs among all possible planes in the reliable 


range: 
Ff, =arg min m(p, f) (4.3) 


where F denotes the candidate set of all planes for each pixel p and is defined 
based on the preceding reliable disparity and normal values mentioned. We use the 
a-expansion algorithm [HCO4] for the whole label optimization. 

In traditional PatchMatch stereo algorithm [BRR11a], we have to calculate the 


pixel-matching aggregation costs according to plane f as 


mp. f) =>. o,,<P.9P(q.9-4,) (4.4) 


where Wyix(p,q) = eV) @D/Y is an adaptive weight function, which is 


used to solve the edge-fattening problem and is computed by pixel p and q color 
difference. The dissimilarity function p(p, q) is defined similar to Equation 
(A.1) .Based on any given plane f, we can calculate the corresponding disparity in 


sub-pixel precision for the current pixel as 
d, = Af Px + br py + Cr (4.5) 


where ay , by and cy are the three parameters of plane f, Py, and Py, 


respectively, which are denoted as the coordinates of pixel p. Notably, this 


continuous and varied label space prevents rapid checking of all possible labels in the 
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label refinement process because one would perform matching cost computation for 
each continuous disparity in brute force. In the next section, we will further provide 
an effective quantizing acceleration strategy for matching cost computation of each 


continuous disparity. 


4.4.5 Quantizing Acceleration for Continuous Disparity Cost 

Labels often correspond to the integer-valued disparity in standard stereo 
matching. Therefore, WTA can be easily used to find the true minimum cost and 
obtain the optimal initial integer-valued disparity for each pixel. However, the label 
number will become infinite for disparity estimation in sub-pixel accuracy, such as 
PatchMatch stereo. Stepping through large label space for continuous disparity 
setting is difficult and will further bring numerous matching cost calculations in each 
label optimization process. 

Based on the preceding analyses, we innovatively discretized the continuous 
disparity into a number of values and computed a linear filter for each value. Then, 
the final output is a linear interpolation of matching cost between the two closest 
disparity values. In practice, the disparity value for each pixel is discretized with 
D= (1, 2, -+, Le, Less, +, Ly}, where Ly is the maximum value of disparity. 
Given the disparity d € [Lx, Ly+1]| of pixel p , the dissimilarity function p(p, p — 


d) in Equation (A.3) can be expressed as: 


p(p, p-d) =(L,,, -A p(p, p- L,) + (d-L,)p(p, p - Ly.) (4.6) 


Instead of directly computing the matching cost for continuous disparity d, we 
show that quantization can be easily implemented and computing performance 
improvement can be achieved (5~~6 times faster on average) with the same output 
accuracy because the cost volume on the discretized disparity set D can be 


pre-computed only once with constant time. 


92 


Chapter 4 Depth Estimation as Core Tech 


4.5 Experimental Results and Evaluation 


All the experiments are implemented on a PC platform with an Intel i7 3.60 GHz 
CPU, 16 GB memory, and an NVIDIA GeForce GTX 980 GPU. We use the same 


parameters as the original PatchMatch stereo matching algorithm [BRR11a] and 
hybrid tree algorithm [VCY14], which are fy, Qt; Top Daa} = {10, 0.9, 10, 2}, 


with a large patch size of 51 pixels. 

The post-refinement strategy is often used to conceal the disparity estimation 
errors after stereo matching computation. We tested our algorithm (without any 
post-refinement for disparity) with two utilized stereo matching methods in our work 
on the stereo pairs of the second and third Middlebury [SSO2, SHK14], and KITTI 
dataset [GLU12] to completely reflect the real performance of each algorithm. We 
compare all the disparity results from each stereo matching methods with 
corresponding ground-truth disparity values provided by the Middlebury benchmark 
under a given threshold. If the estimation deviation of the current pixel is greater 
than the threshold, then it will be considered as an error pixel and marked with red 


(see Figure 4.4 and Figure 4.5). 


Figure 4.4 Visual Comparison on Middlebury Benchmark 


Notes: Visual comparison with other methods for disparity results on Middlebury benchmark. 
From left to right, they are (a) original stereo image pair, (b) Hybrid Tree results [VCY14], (c) 


PatchMatch Results [BRR11a] and (d) Our Method Results. 
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(a) (b) (c) (d) 


_ 


Figure 4.5 Visual Comparison on KITTI Benchmark 


Notes: Visual comparison with other methods for disparity results on KITTI benchmark (error 
threshold is three-pixel). From left to right, they are (a) original stereo image pair, (b) Hybrid Tree 
results [VCY14] (error rate: 14.58%, 17.58%), (c) PatchMatch Results [BRR11a] (error rate: 9.21%, 


17.38%) and (d) Our Method Results (error rate: 9.36%, 12.26%). 


This condition will more vividly express the quality of disparity estimation. 


 Nerror(P) 


Meanwhile, the error rate is formulated as , where the }) Nerror(p) is the 


total 


number of error pixels, and Nrozq, is the total number of pixels in the image. The 
quantitative evaluation results are summarized in Table 4.2, Table 4.3. Our method 
clearly achieved the best average error rates compared with those of the two other 


methods [VCY14, BRR11a]. 


Table 4.2. Objective Evaluation on 2nd Middlebury Benchmark 


OurMethod 


PatchMatch[BRR11a] 


HybridTree[VCY14] 


Notes: Objective evaluation for the proposed method with the 2nd Middlebury benchmark in sub-pixel 


accuracy. 

In Table 4.2, sub-pixel accuracy evaluation is provided with the second 
Middlebury benchmark. Notably, “Tsukuba” in the benchmark should be omitted 
because its groundtruth is quantized to integer values, which is unsuitable for 
sub-pixel comparisons. The average error rates of the three remaining views are 
PatchMatch (7.58%) versus our method (6.83%), and these rates are considered to be 


significant improvements in average error rates compared with that of the 
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PatchMatch method. 


Table 4.3. Objective Evaluation on 3rd Middlebury Benchmark 


Adirondack 


Jadeplant 


Motorcycle 


MotorcycleE 


Notes: Objective evaluation for the proposed method with the 3rd Middlebury benchmark in sub-pixel 


and two-pixel accuracy. Running time of each image pair is also provided. 
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In Table 4.3, sub- and two-pixel accuracy evaluations are provided with the third 
Middlebury benchmark, and the running time of each image pair is also provided. 
Disparity results (our method with cost aggregation or not) are of two kinds, which 
are used to indicate the advantages of the cost aggregation. The results clearly show 
that the proposed method seamlessly merges two independent algorithms (hybrid 
tree cost aggregation and PatchMatch-based label search) to efficiently address the 
problem of large label spaces while maintaining or even improving the solution 
quality. 

Visual comparison with two other methods for disparity results on different 
benchmarks is provided in Figure 4.4 and Figure 4.5. All disparity results indicate 
improvement when using our hybrid tree-guided patch matching. 

We also implemented our method in MVS application to prove the practicability 
of our algorithm. Figure 4.6 and Figure 4.7 provides our depth estimation results 
(without any post-refinement for depth) from images of realistic complex scenes 
(“statue” and “tree”). Reconstructing statue and tree in a realistic way is difficult 
because of the non-uniform color distribution of statue and the inherent geometric 
complexity of trees. We use the proposed method for statue and tree reconstruction, 
which can prove that this approach has good potential for multiple-view 


reconstruction in real scenes. 
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* 
LY: 


(c) Frame 3 for “Statue” (d) Frame 4 for “Statue” 


Figure 4.6 Our Estimated Multi-Views Depth Results 1 


Notes: Our estimated multi-views depth results from real scene images (“Statue”). Row 1 and 
row 3 are the input images captured from four different viewpoints. Row 2 and row 4 are the related 


depth results. 
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(c) Frame 3 for “Tree” (d) Frame 4 for “Tree” 


Figure 4.7. Our Estimated Multi-Views Depth Results 2 


Notes: Our estimated multi-views depth results from real scene images (“Tree”). Row 1 and 
row 3 are the input images captured from four different viewpoints. Row 2 and row 4 are the related 


depth results. 


4.6 Conclusion and Future Work 


Patch match technique is an excellent tool in the field of computer vision that 
has been widely applied for many different kinds of applications especially for depth 
estimation. In the last decade, researchers have put in a lot of works on upgrading 
patch match algorithm. They tried everything including mixing with new algorithms, 


utilizing new data structure, running with high performance hardware or other 
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methods to avoid the drawbacks caused by label searching. But unsatisfactorily, there 
is still no research work can completely break the bottleneck of accuracy and speed 
of patch match algorithm, as more iteration is required for higher accuracy. In this 
paper, we presented a hybrid tree-guided PatchMatch and quantizing acceleration 
algorithm for stereo. In our method, hybrid tree cost aggregation and PatchMatch 
label search complemented with each other. The disparity searching range began 
with the integer-level disparity of hybrid tree cost aggregation and was then refined 
by a PatchMatch optimizer to enhance the accuracy. The experiments show that our 
approach is more robust, accurate and faster in disparity estimation than the two 
other comparison methods. 

For future enhancement, we can integrate our algorithm into more global 
PatchMatch methods, including PMBP [BRFK14] and PM-Huber [HKJK13]. This 
method mainly recovers dense depth images from multiple views image pairs. 
However, we only consider the color consistency problem in pixel matching cost 
computation. In the future, we will add the consideration of geometric consistency 
among pixels on the cost function to improve the details of dense depth images. At 
the same time, in the three-dimensional reconstruction of video, the selection of our 
key frames is all based on manual, which will more or less affect the accuracy of 
depth estimation. Moreover, due to the lack of bundle optimization for the depth 
image between key frames, the depth video always appears flickering. These are the 
current problems that we will further solve based on this proposed method in the 


future. 
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5.1 Introduction 


With the continuous development of 3D reconstruction technology, people have 
higher requirements for accuracy reconstruction. In order to better understand the 
three-dimensional space, various depth sensors have been designed to automatically 
acquire depth information such as laser scanners, TOF cameras and passive stereo 
camera systems. Unfortunately, the resolution of existing depth sensors are usually 
extremely low and cannot directly provide high quality 3D reconstruction. For 
example, the current advanced 3D-TOF camera, the Swiss Ranger, can only generate 
depth maps with a resolution of 100x100. In contrast, the resolution of an ordinary 
home camera can now be higher than 1000x1000. Microsoft's Kinect is currently the 
only commercially available depth sensor that can only produce depth images at 
640x480. It can be seen that the existing depth sensor is only suitable for home 
entertainment, and there is still a long distance to reach the industrial application 
standard. Recently, the 3D reconstruction techniques have gradually replaced 
traditional depth sensors. More and more stereo matching algorithms are proposed 
to estimate depth information for the 3D reconstruction. Such as PatchMatch 
algorithm that we explain in previous chapter, which can provide a sub-pixel accuracy 
depth estimation. Due to various influences in the environment or quality problems 
of the hardware, the existing state-of-the-art depth sensor and stereo matching 
algorithm will inevitably suffer from loss of depth information and distortion, which 
will seriously affect the accuracy of 3D reconstruction. Therefore, post-processing of 
the obtained depth map is very important for the 3D reconstruction algorithm. 

In order to avoid the shortcomings of depth sensors and stereo matching 
techniques, researchers have discover many methods to improve depth map 
resolution and recover missing depth information such as [CBTT08, LTT13, PKT11, 


YYLH12]. Most existing depth map up-sample methods follow the theory that 
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adjacent pixels with similar colors may have similar depth values. They restore the 
missing depth mainly by interpolation based on the registered color image. The 
typical method designs various edge-preserving weights from RGB registration 
images. These weights are then integrated into the interpolation model to improve 
the quality of the depth map. But people may easily ignore one important issue. The 
structure of the depth map and the color image is easy to be mismatched. Because 
the parts with similar colors in one image are not necessarily at the same depth, 
which will inevitably get the wrong weights and lead to poor depth map up-sample 
results, such as edge blur and texture copying. We found that the set of edge pixels 
of a color image typically contains edge pixels of the depth map. For the case where 
adjacent pixels are in the same depth region but the colors are different, the existing 
algorithm generally considers that two pixels do not belong to the same depth and 
directly remove the connection between them. This will prevent the depth 
propagation of the seed pixels to the interpolated pixels, resulting in erroneous 
depth interpolation. To solve above problem, existing methods set the color 
similarity of the coupled pixels to their weights. A method based on strong 
edge-preserving weights copies the texture of the image to a slanted depth surface 
while maintaining a sharp depth edge. Conversely, the weak edge-preserving weight 
method will blur the depth edges to solve texture copying problems. 

In this paper, we proposed a new strategy to essentially solve the problem of 
edge blur and texture copying of depth map interpolation. We first segment the 
depth map into different regions based on color photos. We interpolated depth only 
based on the seeds in the same region. In this way, edge blur problem can be easily 
avoided. In order to completely solve the texture copying problem, we abandon the 
traditional edge-preserving weighting method. We assume that each segment is an 
slant plane and use this assumption to estimate the depth of the missing on each 
segment. In our algorithm, segmentation and interpolation method complement 
each other by using a uniform variational formula to achieve depth map 


interpolation. 
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5.2 Related work 


Most depth map up-sample algorithms focus only on how to interpolate depth 
without considering segmentation. The existing depth map up-sample algorithms can 
be divided into two categories: global methods and local methods. The global 
algorithm traverses the full-pixel pixels, assigning a larger value to the coupled pixels 
with similar colors but different depths. They use the miniaturized MRF [KKC99] and 
AR [ZW08] models to obtain the optimal solution. Instead, the local algorithm only 
focuses on the seed pixels within the region. The depth of their insertions is based on 
the average depth of all seeds. It can be seen that the depth interpolation provided 
by the local algorithm is proportional to the color similarity between the pixel and 
the seed. There are many excellent depth map up-sample algorithms. For example, 
Diebel [DTO5b] first proposed the concept of using the MRF model to implement 
depth map up-sample. Park [PKT 11] further optimized the performance of the MRF 
model, which designed the NLM edge retention coefficient; Subsequently, Yang 
[YYLH12] innovatively proposed the AR interpolation model. Since then, two major 
research directions of the global algorithm have been laid. Contrary to the existing 
global algorithms [DTO5b, PKT11, YYLH12], Kopf [KCLUO7] first proposed the theory 
of local algorithms. He designed a filter to up-sample the depth map. In order to 
improve the performance of local algorithms, Chan [CBTT08] proposed a noise-aware 
filter to solve the problem of artifact copying. To improve efficiency, Liu [LTT13] 
introduces a novel joint geodetic up-sample filter that enables real-time depth map 
interpolation and is more practical for practical applications. 

The existing image segmentation algorithms can be mainly divided into 
supervised segmentation and unsupervised segmentation. Among them, the Potts 
model [NTC13] is currently the most commonly used supervised segmentation model. 
The Potts model provides a lot of convex slack to find the optimal segmentation. At 
present, unsupervised methods mainly include K-means, mean shift and normalized 
segmentation. In this paper, the Potts model is applied to unsupervised 


segmentation, which includes partitioning and depth plane parameter calculation for 
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each segment. 

We innovatively solve the interpolation process and the segmentation process 
jointly. When the image is segmented, the depth map is further up-sampled. In our 
algorithm, a three-dimensional scene is considered to be composed of multiple 
planes, where the plane equation is a;x + bjy + c;,. Based on these planes, we first 
divide the image domain ( into several parts ;. Finally, the interpolation of the 


depth map is implemented based on each part. 


5.3 Proposed Method 


5.3.1 Object Function 

Our core algorithm is an object function consisting of five terms, which is based 
on color photo | to interpolate and segment the depth map. Here, we define L as the 
total number of segments. First, we divide the image domain Q into segments Q);. 
Then, for each segment we estimate the parameters a;, b;, c; of its corresponding 
plane equation a;x+b;y+c;, and finally we interpolate the depth d by 
optimization. 

1. Perimeter Per (Q,;): 

The Perimeter term is used to define better segments. We believe that a good 
partition should be complete and small in size, which means that the boundary 


length of the segmentation area should be as short as possible. Here, we make 


Uj = 19, the indicator function of the segmentation 0;. Then, the perimeter of 


each segmentation area can be defined as Per(Q;) = Te |Vu;|,, We hope the edge 
of each segment tends toward to the strong image edge of color image I, which is 
also likely to correspond to the edge of the depth map. So, we also designed a 
g-weighting term g(x) = exp(—a|VI(x)|) to enhance perimeter function, while 
keep the full function i? g|Vu;| as small as possible. 

2. Label cost prior 4, ||w;ll00: 

Rissanen [Ris78] first proposed the shortest data description principle, which 


considered that any data should be represented by fewer symbols to be more 
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effective. Based on this inspiration, we believe that the total number of segments 
should be given appropriate limits to prevent over-segmentation. In our object 
function, label cost prior is defined as Y:/_,|lu;||.0 to control the maximium partition 
number. 


3. Residual 7;(S): 
In the object equation, we design the residual term 7;(S) = de T(a;xX + by + 


c; — d°|,S) dp to obtain the optimal depth interpolation . It can be understood as 
how similarity being a plane a;x+b;y +c; to the observer depth d° over the 
region 0; (S is the location of the observer data). As can be seen in this term, the 
interpolation function T(-, S) utilize the similarity value |a;x + byy +c; — d°| at 
the position S to estimate the missing depth at the position S = Q\S. Since T(-, S) 
knows the structure of the color image I during interpolation, it can be consider as a 
joint bilateral filter similar with the work of [KCLU07]. When the more suitable plane 
parameters are estimated, the residual term will more close to zero. Experimental 
results prove that our interpolation strategy is more easy to approach zero. Since any 
scene object can be represented as several depth planes. Even for a surface, it also 
can be approximated by several planes. 

4. Smoothterm s; and Dataterm ¢;(S): 

We expect to ensure that all the interpolated depth values are more realistic and 
natural as result. Therefore, the data term and the smoothing term are further 


proposed to detect how the inserted depth value d matches the plane a;x + biy + 


c; and the observation depth d°. We define the smooth term as s; = si,(S) + 


si,(S) = dene |d,d — a;| + |d,d — b;|dp + |A.d—a;|+|0,d—b;|dp and 


Jit 


the data term as t¢;(S) = 


gate) dp. As we can see from the above 


operations, the 54S) will formula the interpolated depths on position S to be a 


planar surface and s;, (S) will help to propagate interpolated depths from Sto S. 


Finally, we can calculate all the parameters u;,a;,b;,c; and d by minimizing our 


object functions Equation (5.1). 
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fi=fF(+ ¢s;,(S) = (Br; + ¢Si, + tt; )(S) + ¢si,(S) 


L 
amin AL,8 | Vu; |+7|eeil|, +f) 
L 
st. Du,=1, u, €{0,]} (5.1) 
i=l 


5.3.2 Optimization 

It is well known that convex optimization problems must satisfy that the 
parameter is a closed set and the function is a convex function. As seen in Equation 
(5.1), u; belongs to an open collection. Obviously, our object function optimization 
is a difficult non-convex optimization problem. But if we change the range of u; to 
[0,1], we can see that the entire model is convex. We propose a new optimization 
strategy that uses the alternate direction method proposed by Boyd et al. [BPC11] to 
classify the function into two sub-problems optimization (Equation (5.2) and (5.3). 
Among them, we define the first sub-problem as Equation (5.2),which is used to 


calculate segment uf forknown ak~1, b¥&1, cK} and dé}: 


+h 


L 
us =arg mind 8 | Vu, |+ y|ju; 
Uj i=l 


L 
s.t. yu =k u, 20 (5.2) 
i=] 


kK and d* 


The second sub-problem as Equation (5.3) is used to calculate a*, b¥, c! 


for known u* , where k is total number of iterations: 


L ~ 
at bic d* =arg min, , a Ufo (S)+5-5, 8} (5.3) 
11, 


Considering the operational efficiency issues, we need to reduce the cost of 
Equation (5.2) and (5.3) iterative calculations. Actually, we only take the last step of 


Equation (5.3) result as our final results. So it is necessary to improve the efficiency 
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of the previous steps. We found that the operation of f,°(S) and si, (S) in 


Equation (5.3) are carried on two disjoint area location S and S$. Since the depth 


value at position s is limited by the observed data and the values at region S$ are 
without restriction. So value of f°(S) must be much larger than Si (S). In addition, 


Per(u;) and 7;(S) calculated at region S are both based on the structure of the 
color image, so the obtained depth information is more accurate than on the region 


S. It is clearly that we can improve the efficiency of computing the depths on the 
region S by simply removing si, from Equation (5.2) and (5.3) without any 
affection. Furthermore, the summation formula in Equation (5.3) can be ignored 
because f° and fj are independent for i # j. 


In summary, we calculate the minimum solution of Equation (5.1) by iteratively 
performing the alternating direction method [BPC11] while minimizing Equation (5.2) 
and (5.3). Specifically, we first apply Equation (5.2) to iteratively split the image 
domain into different segments based on the partial depth d° on S$ and the 
registered guide image I. Finally, Equation (5.3) utilize segments and associated 


partial depths to achieve depth map interpolation. 


5.4 Experimental Results and Evaluation 


The whole algorithm is developed based on the Matlab program platform. After 
a lot of experiments, we finally determined the parameters y, 7, B, t of Equation (5.1) 
in range of [10, 100], [5, 20], [5, 15], [20, 50]. We found that the quality of depth 
up-sample in this range is remarkable. For ease of evaluation, we used fixed 
parameters (Such as y = 30, f= 10, B = 10, t = 30). 

Figure 5.1 shows a comparison of our algorithm's ability to recover and segment 
noise images from current most advanced denoising [SSD09, TMM98, FFLSO8] and 
general segmentation algorithms (K-means, Mean Shift and Normalized Cuts). To 
simplify the evaluation we used a special case where every pixel in the image is seed 


(Suchas 02 =S ). 
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(a) Input (b) EPD result (c) BLF result (d) WLS result (e) Our denosing 
(f) Visualization (g) K-Means result (h) Mean shift result (i) Neut result (j) Our partition 


Figure 5.1 Comparison Between Current Related Methods and Our Algorithm for 


Noisy Images 


In our tests, the grayscale image with noise was used directly for recovery and 
segmentation as shown in Figure 5.1(a). In order to show the results more clearly, we 
have added color to all images. Figure 5.1(e) and Figure 5.1(j) show the denoising 
and segmentation results of our algorithm, respectively. The comparison of the 
denoising results are shown in the first row of Figure 5.1. It can be seen that the EPD 
[SSD09], BLF [TMM98] and WLS [FFLSO8] methods have a texture copy problem. The 
second line of Figure 5.1 shows the results of three normalized cuts. It can be seen 
that the segmentation accuracy is degraded due to noise. However, our algorithm 
has shown excellent performance compared to other related algorithms for both 
segmentation and denoising. 

Our method can implement different kinds of depth map interpolation tasks 
such as random missing, structural missing and up-sample. In these cases, the seed S 
is a subset of the image domain Q. To prove the performance of our algorithm for 
these tasks. We compare the interpolation performance of depth maps with 
structural missing and Possion noise, 5% random missing and 8X up-sample (see 
Figure 5.2 and Figure 5.3). The input image is derived from the Middlebury Stereo 


Dataset [SSO2] and the RGB-D Object Dataset [LBRF11]. 
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Figure 5.2 Performance of Interpolation and Segmentation of Our Method 


Notes: From left to right: groundtruth images, guidance images, synthesized noisy images, our 


restored depths, our partition results, our depth interpolation results. 


(a) Our partition result (b) Ground truth (c) Our up-sample result 
(d) [LTT13] (e) [YYLH12] (f) [PKT11] 


Figure 5.3 Comparison of Up-Sample Results for Noised Image 
Notes: We compare our algorithm with two global methods AR[YYLH12], MRF[PKT11] and a 


local method GF[LTT13], and only ours can eliminate edge blurring and texture copying artifacts. 


According to the experimental results (see Figure 5.3) clearly show that our 


algorithm performance is very stable and is not affected by noise. The interpolated 


image basically restores the quality of the original sharp image. The second line of 


Figure 5.3 shows the depth map captured by the Kinect depth sensor with very low 


output accuracy. However, our model uses a multi-plane based strategy to describe 


the entire depth map, so the interpolation results can exceed the original output of 


Kinect. To further demonstrate the advantages of the proposed depth map 
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up-sample algorithm, our algorithm is compared with two common global methods 
(AR [YYLH12], MRF [PKT11]) and local method GF [LTT13].It can be clearly seen (see 
Figure 5.1) that the other three algorithms have more or less edge blur and texture 
copying for 8X depth up-sample tasks. On the contrary, only our algorithm avoids 


related problems while ensuring the accuracy of up-sample. 


5.5 Conclusion 


In this paper, we have innovatively proposed a variational method for joint depth 
map interpolation and segmentation. By making full use of the characteristics of 
interpolation and segmentation, we use the edge information of the segmented 
region for depth interpolation to avoid blur edges issue. At the same time, we utilized 
the obtained depth value to eliminate the influence of noise during partitioning and 
thus filter. In addition, our algorithm re-describes the structure of the depth map 
through multiple planes. This design can also effectively avoid texture copying and 
edge blur artifacts as well as removing noise from the image. Experiments have 
shown that our method can provide better results for different types of interpolation. 
Meanwhile, it can effectively fix the missing depth information and enhance visual 
effect of our final depth images in different viewpoints, which can also significantly 


improve the modelling accuracy of our multi-view based 3D reconstruction system. 
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6.1 Work Summary 


Taking the multi-view based 3D reconstruction for large scale and complex scene 
as our main target, this article separately studies related supporting technologies to 
enhance the whole performance. Our research can be classified into four sub-works: 
accurate blind deblurring using SalientPatch-based prior, hybrid tree guided 
PatchMatch and quantizing acceleration for multiple views disparity estimation, joint 
depth map interpolation and segmentation with planar surface model and real-time 
pedestrian detection via hierarchy convolutional feature. Through the cooperation of 
above related research works, we finally achieved a more robust and faster 
multi-view based 3D reconstruction algorithm, which can improve the performance 
of current practical applications. 

In our work, we firstly applied a novel blind deblurring algorithm using 
SalientPatch-based prior on the given multi-view images as pre-processing of stereo 
matching to reduce the adverse effects from the input (such as camera motion, noise, 
overexposure and so on). Then, a hybrid tree guided PatchMatch stereo matching 
algorithm with novel quantizing acceleration strategy is proposed to achieve 
sub-pixel accuracy and fast speed depth estimation for multi-view image sequences. 
It essentially relieves the problem of imbalance in accuracy and efficiency of stereo 
matching. In order to further enhance the quality of the estimated depth images, we 
also designed a joint depth image segmentation and interpolation algorithm as 
post-processing of stereo matching. Furthermore, in order to solve the problem that 
the performance of multi-view based 3D reconstruction for large scale and complex 
scene is always being affected by the interferences from moving pedestrians, we 
proposed a real-time pedestrian detection algorithm via hierarchy convolutional 
feature to provide semantic information during stereo matching. It will accurately 


and real-time detect the position of pedestrians to help stereo matching algorithm to 
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avoid the depth estimation of unnecessary areas (dynamic pedestrians) and 
improves the overall reconstruction accuracy and operating efficiency. The 
experimental results demonstrate that each of the sub work has got better 
performances compared with related state-of-the-art. Our multi-view based 3D 
reconstruction algorithm simultaneously achieve high accuracy and high efficiency 
multi-view based 3D reconstruction for large-scale and complex scene, which is 
suitable for practical 3D reconstruction applications. 

The main contributions and innovations of this study include as following: 

1. The existing blind deblurring algorithm mainly focuses on two strategies to 
estimate the blur kernel: 

(1) a full graph based method; 

(2) a patch based method. 

Among them, the blur kernel estimation based on the full graph can improve the 
deblurring precision but sacrifice the algorithm operation efficiency. However, the 
patch-based algorithm shortens the range of the blur kernel estimate and increases 
the speed but affects the quality of the deblurring. It can be seen that the current 
research cannot simultaneously achieve high efficiency and high precision image 
blind deblurring. Our algorithm solves the above bottleneck problems through 
in-depth study of image features. 

Our main contributions can be summarized as follows: We found that the 
criterion for people to discriminate whether the deblured image is clear or not is 
whether the current algorithm can accurately restore the region of interest. In other 
words, the image deblurring algorithm should pay more attention to the recovery of 
foreground objects rather than the background. In this regard, we propose a blind 
deblurring algorithm based on SalientPatch guidance. We generate SalientPatch for 
blur kernel estimation based on object probability, structural richness and local 
contrast. This strategy not only greatly reduces the range of estimation, but also 
makes our blur kernel closer to the groundtruth and the region of interest of the 
image, thus achieving faster and higher quality image recovery than existing 


advanced algorithms. At the same time, the SalientPatch-based kernel estimation 
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method is highly universal and can be widely applied to all current maximum a 
posteriori (MAP) frameworks. 

2. Any the current global or the local stereo (or multiple views) matching 
algorithms is insufficient to show matching accuracy and calculation efficiency during 
the matching processing. It is still a bottleneck problem for most of the current 
stereo matching methods. Therefore, the main achievement of our research is to 
break the current bottleneck of stereo matching in relation to precision and speed 
and utilize the proposed method to obtain relatively high performance in binocular 
and multiple views on large-scale 3D reconstruction applications. we proposed 
hybrid tree-guided PatchMatch and quantizing acceleration algorithm to obtain a 
dense and accurate disparity map in fast processing speed. First, an initial disparity 
map of hybrid tree cost aggregation is generated to constraint label searching range. 
Then, we provide a novel plane refinement strategy for PatchMatch stereo to 
accurately calculate the final disparity. Second, we further presented an effective 
quantizing acceleration strategy for matching cost computation of each continuous 
disparity. 

Our main contributions are the seamlessly combination between two 
independent algorithms (hybrid tree cost aggregation and PatchMatch stereo), which 
can achieve MVS matching applications while maintaining or even improving the 
solution quality. Furthermore, we solve the large label spaces problem of PatchMatch 
and not only significantly accelerate the estimation speed of PatchMatch but also 
improves the accuracy of disparity from one-pixel to sub-pixel-level accuracy by 
utilizing our quantizing acceleration strategy. Experimental results show that our 
proposed algorithm can truly generate high-quality depth images and achieve better 
efficiency compared with those of two original independent methods in Middlebury 
and KITTI benchmark evaluation. In addition, this algorithm is also suitable for 
multi-views reconstruction in the real scene. 

3. The depth information generated by the existing depth sensor or stereo 
matching algorithm always suffer from the problems such as noise and holes, which 


seriously affects the accuracy and visualization of the 3D reconstruction. The depth 
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map up-sample technique is proposed to solve the problem of restoring 
low-resolution depth information. However, existing algorithms always introduce the 
problem of edge blurring and texture copying into the depth map, which results in 
inaccurate depth information being recovered. In order to solve the above problem, 
we propose a joint processing model based on image segmentation and depth 
interpolation for depth map up-sample. 

Our main contributions in the study can be summarized as follows: We segment 
the original depth map into different regions based on the color image. For each 
interpolated pixel, we only use the seed pixels in each partition to interpolate its 
depth. In this way, we solved the edge blur problem. At the same time, we 
innovatively transform the depth map into multiple slant planes and use this 
assumption to estimate the depth of the missing depth on each segment. This 
approach makes the restored depth map more natural and completely solves the 
problem of texture replication. 

4. Most target detectors achieve pedestrian detection through a variety of 
artificial features or integrated deep learning techniques. At present, the mainstream 
CNN-based pedestrian detection algorithm has obtained better detection results 
than the traditional methods. However, experiments have shown that many 
problems are still remaining in these pedestrian detection algorithms. For example, 
existing pedestrian detection algorithms cannot accurately recognize different scales 
of pedestrians, especially for small scales. Moreover, existing algorithms are also 
difficult to handle serious occlusion problems. It seriously affect the development of 
pedestrian detection technology. At the same time, due to the high algorithm 
complexity, none of the pedestrian detection algorithms can achieve a balance 
between detection accuracy and computation efficiency. In order to solve the above 
problems, we proposed a multi-layer convolution feature pyramid based pedestrian 
detection algorithm, which can not only provide high-precision pedestrian detection, 
but also realizes real-time calculation. 

Our main contributions in the study can be summarized as follows: We proposed 


a novel fully convolutional network in which the lower-level feature maps are used to 
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detect small-scale pedestrian and high-level feature maps are for large scales. This 
strategy effectively improve the accuracy of multi-scale pedestrians detection by fully 
considering the characteristics of each feature map from different layers. In the 
training phase, we also designed a predictive box constrained by pedestrian shape 
statistic. This predictive box can effectively accelerate the detection speed and 
significantly reduces the false detection rate. Finally, we further simplify the SSD loss 
function to avoid the impact of the classifier on pedestrian detection performance. A 
large number of experiments have proved that the comprehensive performance of 
our algorithm significantly exceeds the existing top 10 pedestrian detection 


algorithms. 


6.2 Outlook of Future Work 


For multi-view based 3D reconstruction, stereo matching technology is the core 
technique of the entire system, which involves the integration of multiple disciplines 
such as computer vision, computer graphics, pattern recognition and projective 
geometry. With the continuous development of the artificial intelligence industry, 
intelligent applications such as autopilot, robotics, virtual reality and augmented 
reality will provide a fairly wide range of demands for 3D reconstruction. In the 
foreseeable future, stereo matching technology as a technical support for 3D 
reconstruction will always be an important and hot research topic in the next decade. 
In our work, the stereo matching algorithm has got a better accuracy compared with 
the related algorithms. But the object details and edges in the estimated depth 
images are not good enough compared with current outstanding global stereo 
matching algorithms, which will significantly affect our visual effect of 3D modeling. 
Although our stereo matching algorithm significantly improve the calculation speed, 
the efficiency still cannot achieve real-time computation for large-scale scene. 
Obviously, our stereo matching technology still exists shortages that are waiting for 
future works to solve. It can be seen that stereo matching technology still has a 
certain distance to achieve commercialization and productization. For the above 


analysis, the main research direction of our future research work can be summarized 
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as following: 

1. More accurate depth estimation 

Rebuilding a complete three-dimensional model from multi-view images is still a 
difficult problem for computer vision, and the existing methods have many 
limitations. In fact, for some applications (such as image interpolation, depth map 
based segmentation, video editing and etc.), complete 3D reconstruction is not 
necessary and only a high-quality recovered depth image for each image is required. 
Therefore, the key is to ensure that these depth images are as flawless as possible on 
discontinuous boundaries while with good temporal and spatial consistency. The 
overall computational scale studied in this paper is very small, where stereo 
matching is mainly aimed at image pairs or sparse multi-view images. Moreover, our 
current algorithm only considers the problem of pixel color consistency. Although it 
can effectively correct the error depth estimation caused by noise and occlusion, it 
does not recover well for some detailed structures and object contours. In the future, 
we plan to consider both the color and geometric consistency of the objects. The 
visibility and reconstruction noise are modeled in a statistical manner, so that various 
factors such as noise, occlusion, and outlier are integrated and processed into a 
unified framework. In addition, we also plan to apply a global stereo matching 
algorithms (such as BP optimization) to effectively guarantee and improve the 
accuracy of depth estimation. For multi-view 3D reconstruction, our research does 
not introduce the concept of cluster optimization, which will also affect the accuracy 
of our multi-view depth estimation. In the future, we hope to add a clustering 
optimization method in the new stereo matching algorithm to further improve the 
temporal and spatial consistency among multi-view images. 

2. Depth estimation based on deep learning 

Deep learning is a new field in machine learning domain. Its motivation lies in 
building and simulating the neural network of the human brain for analytical learning. 
It imitates the human brain’s mechanisms to interpret data such as images, sounds, 
and text. With the development of deep learning technologies, more and more 


traditional algorithms have begun to transform into deep learning based methods. 
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For stereo matching technology, most of the current researches are focus on 
convolutional neural network based monocular depth estimation. By learning from a 
large amount of training data, the deep learning based stereo matching method has 
begun to fully surpass the traditional methods in accuracy and speed. However, the 
main drawback of existing methods is that the generated depth map is too smooth 
and lacks structural details. In future research, we plan to apply the geometric theory 
involved in binocular and multi-view stereo matching into network learning to 
further improve the computation speed of the depth estimation. At the same time, 
the convolutional neural network will be extended to the application of binocular or 
multi-view 3D reconstruction. 

3. More intelligent application research 

Solving practical application problems is the ultimate goal of all research works, 
which means to combine existing 3D reconstruction methods with specific 
application problems in different fields. This requires to study the relevant key 
technologies and basic theories about the applications. At the same time, we need to 
combine relevant knowledge in relevant fields to design and improve corresponding 
algorithms for specific application problems. Possible applications include smart city 
planning, crime scene reconstruction and analysis, plant modeling and simulation, 
smart driving and so on. All these application studies require the use of large-scale 


3D reconstruction techniques. 
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Multi-view Epipolar Geometry 


In this article, all three-dimensional reconstructions are based on utilizing a 
single camera to obtain multiple views images of the same scene or object (as shown 
in Figure A.1). Through SfM technology we can directly get the internal and external 
parameters of current camera. For the convenience of explanation, all the 
transformation relations are expressed in the form of matrix and vector. It is known 


that camera internal parameters matrix K is: 


fe 10 ©, 
K=| 0 de: Os (A.1) 
0) 0 1 


Ny HF, t, 
R=|4 %& & T=|t, (A.2) 
L Iy ly 3x3 t, 3x1 


We assume that a single camera is used to take photos of a space point X in two 
different views. Then, the coordinate of the camera in the view A is x, and the 
corresponding camera parameters are R, and 7T,. The coordinate of the camera in 
the view B is x2 and the corresponding camera parameters are Rz and T, . The 


following relationship as below: 


x, =R,X +7, 
(A.3) 
x, =R,X +T, 
We can further get the relationship between two coordinates of the camera: 
-1 -I 
x, =(R,R,)x,+(0,-R,R, T) (A.4) 
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It can be simplified as: 
xX, = Rx, +T, (A.5) 


Therefore, the camera relationship from B view to A view can be expressed as: 


Rea = R,R,' 
(A.6) 
i3¢=1,> R,R,'T, 


According to the theory of Epipolar geometry (see Figure A.1), the corresponding 
point p, of point p, in view A must be found on a related polar line 1’ in view B. In 
order to perform multi-view stereo matching, we need to find a relationship that can 
obtain the polar line 1’ of the point p,; in view A corresponding to the view B. In 
this way, we can find the corresponding point of P in the view of B by searching on 


the Epipolar line. 


Figure A.1 Epipolar Geometry 


Let p, be the position of the object P in the O; camera coordinate system and 
pr be the position of the object P in the O, coordinate system. The rotation and 


translation matrix of O, relativeto O; is Rg, and Tg, .So we can get: 


P, = ReaP, +Tpa (A.7) 


Since R is an orthogonal matrix, it can be written as: (p, —Tg4) = Reap, - AS 


shown in Figure A.1, the three vectors ( O,P, OgP and O,Oz ) are coplanar, which 
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means OpP- (0,03 x O,P) = 0. Uniforming the coordinates of each point to the 


left view, we can get: 


O4P = p, —Tpa 


0,08 = Tea (A.8) 
OsP = pr 


Then their mixed product can be expressed as: 


Pe a x CPs —Tz,)) =0 
PS . (Tea * ReaP)) =0 (A.9) 


Pe (Tp). ° Rea)? P) =O 


The matrix form of a cross-multiply is expressed as: 


Ta x P, = (T.P, -T.F,) a (TF, =L FI = (TF, —T)P,)k (A.10) 


~ T, ie Px yo Iz Ay ly 
Ta, P; =| T, a Pi |=) 44 TF xf i 
= si 0 Piz x” ly 7 y~ Iz 


i Se. oe 
Let Seah Te 0 -T;,|, Thenwehave p2Rp,Sp, = 0. 
at. ts, 0 


Let E= Rp,S , The equation can be expressed as: 


prEp, = 0 (A.11) 


Here p, and p; are the positions in the unit distance coordinate system. If we 


want to analyze the image, we need to convert them into the pixel coordinate system. 
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The camera internal matrix K is known and the two views are getting from the same 


camera. So we can get: 


P, = Kp, 
P, = Kp, 
= (A.12) 
pp=kK 'P 
p, =K*p, 
Thereby, we merge the Equation (A.11) with (A.12): 
(K"' p,)’ E(K'p,) =0 
— See $5 os 
p, (KEK )p, =0 
ae (A.13) 


p, Fp, =0 
F=K'R,,SK' 


Assume that the p; and F is known then Pp, (Fp) =(. The linear constraint 
equation of its matching point can be expressed as: 


-T 


a 7. SO. 0 -7F T,]R Rk RIF 0 Cy [x 
bl|=Fp,=|0 f, C,| | T. 0 -T.|R, R R FCA 9 
c 0 o 1]|-r fF, 0|R, R RO AG 


The linear (Epipolar line) constraint equation is: ax + by +c =0 
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